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ABSTRACT 

Speech  processing   schemes  which  result  in  a  reduced  transmission 
bandvridth  for  voice  coraraunications  have  been  the  subject  of  intensive 
investigation  in  recent  years.     This  paper  describes  a  new  speech 
analysis-synthesis   scheme  for  bandwidth  reduction.     The  speech  analyzer 
develops  seven  analogue  control   signals  from  the  speech  signal.     These 
control  signals  require  a  total  bandwidth  of  approxiinately  140  cps  for 
transmission  to  the   synthesizer  which  utilizes  the  control  signals  to 
continuously  synthesize  artificial  speech. 
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1,  Naval  Tactical  Conununications  System. 

Ihe  exchange  of  tactical  information  within  operational  units  of 
the  Naval  Establishment  has  for  many  years  been  centered  siround  voice 
communications.  But  Just  as  the  manner  in  vAiich  warfare  is  conducted 
changes,  so  must  change  the  means  by  which  operational  information  is 
exchanged.  There  exists  one  basic  criteria  by  which  the  means  of  ex- 
change for  information  of  this  type  must  ultimately  be  judged.  This 
criteria  is:  Does  the  means  operate  as  an  enhancement  or  as  a  constraint 
on  the  current  manner  in  which  warfare  is  conducted.  It  is  of  paramount 
importance  that  the  means  of  communicatic«i  in  no  way  restricts  naval  tac- 
tics or  the  full  use  of  current  naval  weaponry.  The  tremendous  scope  of 
naval  warfare,  the  extreme  destruction  and  speeds  involved  in  current 
weapons  and  their  manner  of  delivery,  and  their  requirements  of  versatil- 
ity, flexibility,  and  mobility  on  naval  tactics  create  requirements  on 
operational  communications  vshich  are  of  the  most  stringent  and  severe 
character. 

The  inadequacy  of  today's  voice  communication  system  in  meeting  the 
demands  for  a  tactical  information  exchange  media  has  been  obvious  for 
some  time.  Voice  communicaticm  information  exchange  rates  are  completely 
insufficient  to  cope  with  the  problems  of  modern  day  air  defense.  The 
extreme  bandwidth  requirements  of  voice  communication  has  long  ago  led  to 
an  tffifulfilled  demand  for  tactical  communication  channels.   The  acute 
shortage  of  frequency  spectra  caused  by  the  use  of  extremely  wide  band- 
width channels  is  a  problem  which  must  be  solved.  The  advent  of  the 
various  Tactical  Data  Systems  has  been  a  direct  consequence  of  this  voice 
communication  inadequacy.  And  with  the  impact  of  the  Tactical  Data  System 
upon  the  naval  communication  scene  a  re-evaluation  of  voice  communications 


is  inevitable. 

Consider  the  scope  of  operations  in  which  the  Navy  must  perform.  The 
Navy  is  involved  In  air,  sea,  underwater,  and  assault  landing  operations. 
The  Navy  is  concerned  with  guided  missile  submarine  operations,  hunter- 
killer  antisubmarine  operations,  fast  carrier  operations  with  air  attack 
capabilities,  assault  landings  across  defended  beaches  using  the  concept  of 
air  envelopment,  air  defense  against  both  guided  missiles  and  manned  air- 
craft, and  a  myriad  of  other  operations.  The  Naval  Establishment  does  and 
must  have  some  capability  in  every  type  of  warfare  known  to  man.  The  Navy 
must  be  able  to  conduct  all  of  these  operaticxis  anywhere  in  the  world,  not 
from  fixed,  but  from  highly  mobile  bases,  and  in  an  extremely  short  amount 
of  time. 

Dispersion  of  naval  forces  became  a  necessity  with  the  advent  of  ther- 
monuclear devices.  High-speed  aircraft  and  missiles  have  made  the  reaction 
time  both  for  offensive  and  defensive  operations  critically  short. 

What  then  are  the  demands  today  upon  a  naval  tactical  communication 
system?  The  system  must  handle  tremendous  amounts  of  varied  information. 
It  must  handle  this  information  quickly  and  reliably  over  far  greater 
distances  than  ever  before.  It  must  do  all  this  vriiile  operating  under  a 
very  serious  constraint.  That  constraint  is  the  limited  electromagnetic 
frequency  spectrum  available  to  naval  forces,    ^ 

The  Tactical  Data  Systems  are  a  great  step  toward  the  fulfillment 
of  these  demands.  But  no  data  system  complex  can  handle  more  situations 
than  those  for  vdiich  it  is  built.  Data  system  complexes  are  built  to 
handle  a  given  number  of  situations.  If  an  enemy  so  conducts  his  mili- 
tary operations  such  that  they  are  not  one  of  the  given  nximber  of  situa- 
tions, then  other  means  must  be  available  for  information  exchange  on 


his  operations. 

Consider  a  data  system  complex  which  might  be  created  to  handle  a 
confcined  Navy-Marine  assault  across  a  defended  beach.  No  congDlex  could 
be  created  to  handle  the  information  connected  with  every  eventuality, 
every  variation  that  the  operation  might  take.  True,  a  complex  can  be 
created  to  handle  a  great  deal  of  the  information  connected  with  an 
assault  landing.  But  it  is  impossible  to  categorize  or  even  know  every 
bit  of  information  that  might  have  an  exchange  requirement.  And  if  every 
variation  is  not  known,  then  the  system  cannot  be  designed  to  handle  it. 
This  philosophy  is  equally  applicable  to  HUK  operations,  ASW,  air  defense, 
etc. 

It  appears  apparent  that  every  data  system  complex  must  have  associat- 
ed with  it  some  means  for  handling  what  might  loosely  be  called  the  un- 
expected variations  of  warfare.  This  flexibility  in  the  Tactical  Communi- 
cation System  is  deemed  to  be  extremely  critical.  No  potential  enemy  can 
be  COTisidered  so  unprofessional  as  not  to  take  immediate  advantage  of  any 
lack  of  flexibility  in  our  communication  system.   It  is  believed  that 
voice  commtmication  as  a  mode  of  information  exchange  provides  the  most 
flexible  communications  capability. 

Is  then  a  tactical  naval  communication  system  to  be  burdened  not 
only  with  the  prodigious  number  of  voice  nets  now  required,  but  also  a 
number  of  data  system  complexes?  This  writer  considers  the  answer  to  be 
in  essence,  yes. 

In  brief,  the  observations  made  thus  far  are: 

1.  Voice  communication,  as  known  in  the  Navad  Establishment  today, 
is  no  longer  adequate  to  serve  as  the  primary  means  of  tactical  informa- 
tion exchange. 


2.  Data  system  complexes  are  replacing  voice  communications  as 
the  primary  media  for  exchange, 

3.  Design  limitations  on  data  system  complexes  and  the  vital  require- 
ment of  communication  flexibility  require  that  there  be  associated  with 
data  system  complexes  a  communication  mode  possessing  great  flexibility, 

A,  Voice  communications  possesses  great  flexibility, 

5.  There  exists  an  extremely  critical  shortage  of  available  fre- 
quency spectra* 

6,  Bandwidth  occupancy  of  additional  frequency  space  by  data  system 
complexes  make  the  ccwnraunication  picture  completely  untenable. 

From  these  observations,  it  may  be  concluded  that  tactical  cooimunica- 
tions  will  be  carried  out  by  data  system  complexes  which  will  be  supple- 
mented by  voice  communications,  and  that  the  transmission  voice  signals 
must  be  accomplished  using  very  much  narrower  bandwidths  than  are  now 
occupied. 


2,  Criteria  For  Voice  Communications  Systems, 

This  paper  is  an  investigation  of  a  newly  conceived  speech  process- 
ing technique  and  the  development  of  circuitry  to  achieve  the  required 
speech  processing.  The  particular  line  taken  by  the  investigation  and 
the  goals  aimed  at  are  based  upon  a  set  of  criteria  which  are  considered 
to  be  applicable  to  a  military  voice  communication  system.  The  role 
played  by  the  voice  comnimication  system  is  considered  as  a  supplement 
to  data  system  complexes  and  an  integral  part  of  an  over  tactical  commun- 
ication system. 

First,  the  required  bandwidth  for  the  voice  channel  must  be  as  small 
as  possible,  subject  to  other  considerations.  The  intelligibility  of  the 
system  must  be  firmly  based  upon  individual  word  recognition  by  the  human 
receiver  at  the  output  end.  High  intelligibility  scores  on  connected 
text  are  not  considered  adequate.  For,  in  connected  text,  the  mind  has 
the  unique  ability  to  fill  in  isolated,  unrecognized  words  based  on  the 
line  of  thought  of  the  text,  A  major  part  of  naval  voice  traffic  consists 
of  prowords,  individual  code  words,  and  in  general  unconnected  text  where 
the  absolute  recognition  of  words  is  essential.  Word  recognition  is  a 
basic  must  and  as  such  acts  as  a  constraint  on  the  level  of  bandwidth  com- 
pression achievable.  Bandwidth  compression  involving  compression  in  the 
time  domain  possesses  undesirable  attributes.  Systems  of  this  type  in- 
volve time  delays.  Although  the  delays  involved  are  usually  small,  it  is 
felt  that  in  an  era  of  Mach  two  or  three  aircraft,  a  voice  system  which 
has  no  time  delay  between  the  input  to  a  voice  channel  and  the  output,  is 
a  more  preferable  system.  It  was  felt  that  the  investigation  should  thus 
proceed  into  "no  delay"  systems. 


speech  processing  adds  additional  con^jonents  to  a  conventional  voice 
transmission  system.   In  one  direction  speech  must  proceed  through  a 
speech  analysis  component,  through  a  transmitting  device,  a  receiving  de- 
vice, and  a  speech  synthesizer*  The  actual  speech  analysis  and  synthesis 
devices  may  be  identical  for  all  military  services.  These  devices  should 
also  be  compatible  with  any  system  of  transmission  or  modulation  scheme. 
Thus,  the  speech  processing  units  should  work  equally  well  whether  the 
voiced  information  is s ent  via  SSB,  AM,  FM,  with  any  modulation  scheme, 
delta  modulation,  frequency  multiplexing,  or  schemes  of  a  digital  nature. 

Digital  transmission  of  the  speech  information  possesses  qualities 
that  are  desirable.  These  qualities  are  increased  range,  improved  re- 
liability, and  inherent  security.  Classified  techniques  of  digital  trans- 
mission offer  even  more  attractive  qualities.  Digital  transmission  has  the 
disadvantage  of  practically  requiring  more  bandwidth  than  is  encompassed 
by  the  sampled  wave  itself. 

It  is  felt  that  the  modulation  scheme  to  be  used  in  transmitting  the 
speech  information  is  properly  the  subject  of  a  full  investigation  itself, 
and  is  beyond  the  scope  of  the  current  investigation  into  the  processing 
of  speech. 

Weight  considerations  are  of  the  utmost  importance  in  the  develop- 
ment of  the  speech  processing  devices.  Inasmuch  as  these  devices  are 
additional  equipment  that  must  be  carried  by  aircraft,  etc.,  an  extrap- 
olation  into  the  future  state  of  the  electronic  art  was  made  such  that  a 
sizable  weight  reduction  over  the  equipment  developed  during  this  investi- 
gation should  be  realizable  within  one  to  two  years. 

The  ultimate  speech  processing  technique  used  in  a  tactical  voice 


communication  system  should  provide  a  level  of  security  over  that  v*iich 
may  be  obtained  from  the  modulation  scheme.  The  particular  information 
bearing  signals  at  the  output  of  a  speech  synthesizer  should  be  of  such 
a  character  that  a  compromise  of  the  channel  depends  not  only  on  a  corar- 
plete  knowledge  of  the  modulation  scheme,  but  also  the  exact  role  played 
by  the  information  bearing  signals  in  the  processing  scheme, 

A  question  that  must  be  considered  is  whether  the  speech  processing 
scheme  should  be  of  such  a  character  as  to  permit  individual  voice  rec- 
ognition.    The  degree  of  bandwidth  compression  obtainable  in  speech  pro- 
cessing is  a  direct  fxinction  of  this  speaker  recognition  level. 

Several  factors  must  be  considered.     It  is  a  known  fact  that  it  is 
possible  to  detennine  individual  ship  location  and  movement  from  the  re- 
cognition of  CW  operators  by  their  particular  traits.     Inasmuch  as  the 
number  of  voice  comraanicators  is  reasonably  sma3.1,   speaker  recognition 
provides  an  easy  means  for  ship  recognition.     A  degree  of  security  is  thus 
provided  by  having  a  system  in  which  all  voices  sound  alike. 

Contrariwise,  with  non-recognition,   it  is  impossible  to  tell  an  enemy 
voice  from  a  friendly  voice.     This  is  not  felt  to  be  a  strong  counter-argu- 
ment for  even  vdth  speaker  recognition,   it  cannot  be  expected  that  enemy 
voices  will  necessarily  sound  different.     It  is  felt  that  authentication 
techniques  will  provide  the  desired  security.     Also,  higher  degrees  of 
bandwidth  compression  are  attainable  with  speaker  non-recognition,     A  sys- 
tem having  no  speaker  recognition  is  believed  to  be  more  desirable  because 
of  its  greater  advantages. 

Another  feature  which  should  be  included  in  any  voice  communication 
system  is  that  there  should  be  a  relative  silence  at  the  terminal  end  of 
the  system  between  words.     In  clipped  speech  systems,  for  instance,  between 


words  noise  generates  zero  crossings  with  the  result  that  the  output  in 
the  absence  of  speech  is  very  noisy. 

In  conclusion,  the  guideposts  for  this  investigation  and  the  criteria 
which  are  believed  to  form  a  basis  for  a  military  voice  communication 
system  are: 

1,  Minimum  bandwidth  occupancy  per  voice  channel. 

2,  Word  recognition. 

3,  No  time  delay  from  speaker  to  receiver, 

U.     Compatability  of  the  speech  processor  with  any  mode  of  trans- 
mission or  modulation  .scheme, 

5»  Minimum  weight,  and  thus  circuit  simplicity. 

6,  A  level  of  security  derived  from  the  speech  processing  itself, 

7,  Speaker  non-recognition. 


3.  Speech  Parameters  and  Phenomena, 

A  survey  of  the  literature  in  the  field  of  Speech  Processing  shows 
that  much  and  yet  little  has  been  done.  Organized  scientific  investiga- 
tion of  any  magnitude  in  this  field  has  been  restricted  in  time  to  the 
last  20  years.  This  upsurge  of  research  and  investigation  has  been  the 
direct  result  of  need:  the  need  to  meet  the  increasing  demands  upon 
communication  services  imposed  by  both  civilians  and  the  military;  the 
need  to  find  an  economy  in  the  means  of  exchange  of  voiced  information 
by  electronic  devices.  An  economy  is  needed  that  is  both  an  economy  of 
channel  bandwidth  and  equipment.  The  inefficiencies  involved  in  the 
current  electronic  means  of  exchanging  voiced  information  by  transmitting 
a  replica  of  the  speech  waveform  have  long  been  common  knowledge  to  the 
communication  engineer. 

The  field  of  human  communication  is  an  extremely  broad  one.  Investi- 
gations in  this  field  have  been  carried  out  by  the  psychologist,  the 
acoustic  engineer,  the  linquist,  the  phonologist,  and  experts  in  the  field 
of  communication  and  information  theory.  Common  to  all  these  lines  of 
investigation  is  the  vast  lack  of  knowledge  of  the  mechanism  by  viiich 
the  human  perceives  speech.  This  is  the  basic  and  unsolved  problem  of 
human  communication. 

The  human  perception  mechanism  is  a  completely  astounding,  fascinat- 
ing and  little  understood  thing.  The  means  by  which  a  human  is  able  to 
classify  many  diverse  physical  stimuli  into  the  same  category  is  an  area 
of  colossal  ignorance.  In  the  case  of  auditory  recognition  the  same  words 
spoken  by  a  man  and  a  woman  are  drastically  differaat  in  their  acoustic 
content,  and  yet,  the  listener  has  little  difficulty  in  establishing  they 
are  the  same  word.  The  speech  waveform  for  a  spoken  word  varies  from 
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parson  to  person  and  even  varies  with  time  with  a  given  person.  The 
accents  of  various  speakers,  the  emotional  frame  of  the  speaker  all  lead 
to  an  endless  variety  of  waveforms  for  the  same  spoken  word.  Yet,  the 
listener  is  able  to  correctly  classify  the  word.  The  mechanism  by  which 
this  auditory  recognition  is  continuously  carried  out  in  the  face  of  non- 
speechlike  acoustic  stimuli  (wind  noises,  machinery  noises  and  other  en- 
vironmental sounds)  is  little  understood  at  the  present  time. 

The  endeavors  of  the  various  types  of  investigators  in  the  field  of 
human  communication  has  lead  to  an  array  of  hints  and  clues  about  the 
auditory  recognition  mechanism.  A  great  number  of  phienomena  concerning 
speech  and  its  perception  have  been  observed  and  reported.  But  all  of 
the  acquired  knowledge  has  not  led  to  such  a  level  of  understanding  that 
the  communication  engineer  may  analytically  design  an  efficient  means  for 
electronically  exchanging  voiced  information. 

The  communication  engineer  today  is  attempting  to  solve  two  closely 
allied  problems;  the  problem  of  efficiently  communicating  between  men, 
and  the  problem  of  direct  voice  communication  between  man  and  machine. 
Communication  between  man  and  his  machines  is  at  present  confounding  some 
of  the  best  scientists  in  the  world.  Progress  in  this  area  has  been 
difficult  and  the  results  meager.  Communication  between  men  with  regard 
to  reqiiired  bandwidths,  reliability,  etc.,  has  progressed  almost  as  slowly 
as  man-machine  communication  with  slightly  better  results. 

The  processing  of  speech  to  achieve  the  aforementioned  economies  in 
the  electronics  exchange  of  speech  information  between  men  is  the  problem 
of  the  comraxmication  engineer.  These  engineers  utilizing  the  hints  and 
clues  provided  by  allied  investigators  in  the  field  of  human  communica- 
tion, taking  cognizance  of  the  reported  phenomena  and  hypothesis  have 
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achieved  a  certain  level  of  success  in  providing  devices  to  meet  the 
demanded  economies.  One  of  the  first  of  these  devices  develoned  and 
perhaps  the  most  well  known  is  the  Vocoder  as  developed  by  Dudley, 

The  activities  of  the  communication  engineer  in  the  area  of  man  to 
man  communication  has  been  and  is  device  stimulated.  The  goal  has  been 
to  develop  a  means  and  a  device  to  achieve  bandwidth  compression  and  in- 
creased reliability  without  a  complete  knowledge  of  human  perception  and 
communication.  But  really,  the  entire  field  of  electronics  is  character- 
ized by  this  type  of  thing.  Awe  inspiring  progress  was  made  by  scientists 
who  had  little  or  no  knowledge  of  the  electron  or  how  it  performed.  As  a 
result  of  this  viewpoint  research  in  the  speech  processing  field  has  been 
and  is  along  non-analytic  lines.  What  must  be  said  is  that  we  do  not  knew 
enough  about  the  field  to  be  analytic. 

Before  considering  the  particular  investigation  presented  in  this 
pgper,  it  is  necessary  to  discuss  briefly  the  speech  production  mechanism 
and  the  various  hints,  clues,  and  reported  phenomena  about  human  communi- 
cation available  to  the  researcher  in  the  field  of  speech  processing. 

The  process  of  speech  production  may  be  regarded  as  similar  to  that 
of  a  carrier  system  in  iirfiich  the  modulation  of  a  vocal  cord  tone  or  wide 
band  fricative  noise  is  effected  by  the  movement  of  tongue,  lips,  jaws, 

and  other  parts  of  the  articulation  mechanism}  and  by  the  rescxiant 

2 
qualities  of  nasal,  mouth,  and  throat  cavities.   The  lungs  supply  to  the 

larynx  and  its  associated  vocal  folds  the  breath  stream  which  is  the  driv- 
ing force  for  the  system.  The  current  theory,  as  discussed  by  Stetson-^ 
is  that  the  lungs  do  not  supply  the  vocal  mechanism  with  air  at  constant 
pressure  during  speech  but  in  a  pulsating  manner  so  as  to  aid  in  syllable 
production.  Of  course,  if  a  given  speech  sound  is  maintained  for  a  long 
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period  of  time  such  as  is  encountered  when  the  sound  is  sung  then  the 
air  is  supplied  at  a  constant  pressure. 

The  breath  stream  is  constituted  of  a  vast  number  of  turbulent 
motions,  each  of  minute  energy,  and  so  the  driving  force  for  the  vocal 
cords  is  an  acoustic  spectra  of  uniform  energy.   The  vocal  cords  operat- 
ing on  the  breath  stream  determine  which  of  the  two  basic  types  of  acoustic 
excitation  is  presented  to  the  upper  vocal  organs  for  modulation.   If  the 
vocal  folds  remain  in  a  fixed  open  position,  such  as  does  occur  for  frica- 
tive sounds,  then  the  breath  stream  passes  through  the  glottis  (the  space 
between  the  vocal  folds)  to  be  modulated  by  the  resonant  cavities  of  the 
upper  vocal  tract,  the  nasal  and  mouth  cavities,  and  the  teeth.   The 
modulation  of  the  uniform  energy  breath  stream  by  these  upper  vocal  organs 
results  in  a  reinforcement  of  certain  broad  frequency  regions  within  th« 
spectrum  of  the  breath  stream.  The  sounds  produced  by  this  turbulent  ex- 
citation are  usually  referred  to  as  unvoiced  sounds.  The  fricative  "a" 
is  a  sound  produced  by  turbulent  excitation.  Spectral  analysis  has  shown 
that  the  areas  of  reinforcement  are  in  general  above  30CX)  cps  for  sounds 
produced  in  this  manner. 

The  sound  tj^e  of  acoustic  excitation  is  produced  when  the  vocal  cords 
or  folds,  as  they  are  inore  correctly  called,  do  not  remain  in  the  fixed 
open  position  but  open  and  close  periodically.  The  larynx  contains  the 
vocal  folds  and  the  associated  muscles  for  controlling  the  mode  of  opera- 
tion of  the  vocal  folds.  The  larynx  may  be  divided  into  three  areas: 
1,  the  subglottic  cavity;  2,  the  space  between  the  vocal  folds,  the  glot- 
tisj  and  3.  the  supraglottic  cavity.   The  subglottic  cavity  operates  to 
concentrate  the  breath  stream  toward  the  glottis.  The  primary  laryngeal 
tone  is  produced  at  the  glottis  for  voiced  sounds;  while  the  supraglottic 
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cavity  commences  to  form  the  timbre  of  the  voice.  The  classical  aero- 
dynamic theory  of  phonation  describing  the  mode  of  vocal  fold  vibration 
has  in  recent  years  become  accepted.  This  "air  puff"  or  "air  burst"  theory 
describes  the  sequence  of  vocal  fold  vibration  as  follows:  1.  closure  of 
the  glottis;  2.  accumulation  of  subglottic  pressure;  3.  explosion  of  the 
closed  vocal  folds  and  the  escape  of  an  air  puff  or  burst  through  the  open- 
ed glottis  J  4.  relaxation  of  the  folds  to  the  closed  position;  and  5. 
repetition  of  the  cytJle.  The  resulting  pressure  waveform  at  the  upper  end 
of  the  larynx  is  a  rough  asymmetrical  sawtooth  very  rich  in  harmonics. 
The  Fourier  line  spectra  produced  is  not  one  of  uniform  energy.  The  lower 
harmonics  contain  most  of  the  energy.  As  the  harmonic  number  increases 
the  associated  energy  decreases.  The  periodicity  of  the  vocal  fold  burst 
is  determined  by  the  tension  of  the  vocal  folds. 

The  Fourier  line  spectra  at  the  larynx  during  phonation  is  modulated 
by  the  upper  vocal  organs  and  cavities  such  that  certain  harmonics  are 
attenuated  and  others  are  reinforced.  Particular  frequency  regions  in 
the  spectra  which  are  reinforced  more  strongly  are  cauLled  formants.  The 
sounds  produced  by  this  harmonic  excitation  are  called  voiced  sounds.  The 
vowels  are  all  voiced  sotinds.  In  general,  there  are  three  formants  which 
occur  during  voiced  sounds.  These  formants  usually  occur  within  the  follow- 
ing frequa:icy  regions: 

F^      270  to  730  cps 

F2   840  to  2230  cps 

F^  2240  to "3010  cps 
Ihe  frequency  corresponding  to  the  repetition  rate  of  the  vocal  fold 
burst  is  the  fundamental  of  the  Fourier  series,  "Hie  frequency  correspond- 
ing to  the  pitch  as  heard  by  the  listener  is  in  most  cases  the  fundamental 
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frequency  of  the  Fourier  series.  In  other  cases  the  pitch  frequency 

7 
may  be  the  second  or  third  harmonic  frequency.   Pitch  phenomena  and 

the  extraction  of  the  pitch  frequency  from  speech  by  speech  analyzers 

has  plagued  investigators  for  many  years.  Inasmuch  as  the  method  of 

pitch  extraction  developed  and  utilized  in  this  investigation  is  unique, 

a  fuller  discussion  of  pitch  will  be  delayed  until  Section  5,  in  which  the 

conceptual  details  of  the  investigation  conducted  will  be  presented. 

Figure  1  shows  the  waveform  of  the  larynx  source  for  voiced  sounds. 
Figure  2  shows  the  approximate  spectrum  of  the  larynx  source  energy  for  a 
voiced  sound.  Figure  3  shows  a  typical  speech  waveform  for  voiced  sounds, 
and  Figure  A  shows  the  Fourier  line  spectra  for  the  wave.  The  three  for- 
raants  are  easily  distinguished.   It  should  be  noted  that  the  larynx  har- 
monics may  or  may  not  lie  exactly  at  the  same  frequency  as  the  peaks  of 
the  formants. 

The  starting  point  for  all  electronic  communication  systems  whose 
function  is  to  provide  a  means  for  the  exchange  of  spoken  information  is 
the  acoustic  pressure  wave  generated  at  the  lips  of  a  speaker.  Communi- 
cation engineers  working  in  voice  communications  have  conducted  analysis 
of  speech  in  both  the  time  and  frequency  domains.  The  results  of  these 
investigations  has  shown  that  while  the  analyzed  speech  of  an  individual 
speaker  is  directly  correlatable  to  the  operation  of  his  vocal  organs, 
the  correlation  between  the  observed  phenomena  in  the  time  and  frequency 
domains  for  different  speakers  is  far  from  satisfactory.  Sixty  persons 
may  say  the  vowel  "a"  and  the  associated  pitch  of  the  sound  may  be 
different  for  all.  An  important  part  of  the  vowel  sounds  is  the  position 
of  the  formants.  The  formants  for  a  given  sound  shift  up  and  down  in  the 
frequency  domain  depending  on  whether  the  speaker  is  male  or  female. 

Unfortunately,  the  formants  do  not  keep  the  same  relative  positions  as 
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they  shift  around,  Formant  positions  for  a  given  sound  for  a  given  speaker 
also  are  not  always  in  the  same  position.   In  general,  the  acoustic  stimuli 
for  a  given  sound  and  for  speech  varies  from  speaker  to  speaker  and  from  a 
given  speaker  with  time. 

A  typical  long-time  average  of  the  voice  spectrum  is  shown  in  Figure 
5.  A  consideration  of  this  curve  shows  that  almost  all  of  the  power  of 
speech  is  below  6000  ops.  As  a  result,  speech  processing  techniques  have 
dealt  with  speech  as  if  it  were  bandlimited  to  an  upper  value  of  6000  cpa. 
The  telephone  system  has  shown  that  a  high  degree  of  intelligence  results 
when  only  one  half  this  amount  of  bandwidth  is  considered.  The  effect  of 
cutting  off  high  and  low  frequencies  on  the  articulation  of  different 
classes  of  speech  has  been  investigated  by  Steinberg,   Figures  6,  7,  and 
8  show  some  of  the  results  of  his  investigation.  From  these  curves  it 
appears  that  frequencies  below  400  cps  and  frequencies  above  6000  cps  can 
be  removed  with  little  effect  upon  articulation. 

Speech  communication  may  be  likened  to  a  black  box.  The  input  to  the 
box  is  the  speech  wave.  At  the  output  is  the  information  perceived  by  the 
human  sensor.  Inside  the  black  box  is  the  auditory  perception  mechanism 
about  which  little  is  known.  The  goal  of  speech  processing  is  to  reduce 
the  data  in  the  speech  wave  by  some  scheme,  present  this  reduced  data  to 
the  input  of  the  black  box  and  have  the  human  sensor  perceive  the  same 
intelligence  from  the  reduced  data  as  he  would  if  the  input  wave  were  the 
original  speech  wave.  For  this  investigation  the  intelligence  perceived 
has  been  defined  to  exclude  such  information  as:  1.  emotional  status  of 
the  speaker;  and  2,  speaker  recognition. 

Experimentation  on  the  inputs  to  the  black  box  and  observation  of 

the  intelligence  perceived  has  lead  to  hints  and  clues  about  the  nature 

of  the  auditory  recognition  mechanism.  First  of  all,  the  auditory  recog- 
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nition  mechanism  is  not  a  constant  parameter  mechanism.  If  one  doubles 

the  frequency  of  a  pure  tone,  the  pitch  perceived  by  a  listener  is  not 

9 
twice  as  high.  Work  by  Stevens  and  Valkraan  has  led  to  the  establishment 

of  a  pitch  scaile  which  relates  the  sensation  c  aused  by  a  frequency  to  the 

frequency  producing  it.  Figure  9  shows  the  relation  between  pitch  in  raels 

and  frequency.  Similarly,  the  relationship  between  intensity  and  loudness 

of  the  acoustic  stimuli  in  non-linear. 

The  human  sensor  frequently  supplies  information  for  which  there 

appears  to  be  no  stimuli  in  the  physical  signal.  If  a  listener  is  pre- 

10 
sented  with  a  pure  tone,  he  may  report  he  also  hears  the  harmonics.    In 

fact,  if  an  auxiliary  oscillator  is  introduced  at  a  frequency  three  or 
more  times  the  original  tone,  listeners  also  report  they  hear  a  beat  fre- 
quency with  one  of  the  aural  harmonics.  The  pitch  heard  from  the  original 
tone  may  also  be  varied  by  changing  the  stimulus  time.  If  a  listener  hears 
a  tone  for  20  milliseconds,  he  will  report  that  the  pitch  is  lower  than  if 

he  heard  the  same  tone  for  five  seconds.  The  shortest  note  which  sets  up 

4 

any  sensation  of  pitch  has  a  duration  of  approximately  10  to  20  milliseconds. 

The  human  sensor  is  also  capable  of  supplying  the  fundamental  if  only 
the  harmonics  are  given.  The  frequencies  2000,  2200,  and  2400  cps  will 
separately  cause  percepticai  of  pure  tones  with  a  pitch  of  2000,  2200,  and 

2400  cps.  Together  they  will  lead  to  the  perception  of  a  sharp  sound  with 

7 
a  pitch  of  200  cpfe. 

Ohms  law  of  hearing  is  frequently  quoted  as  though  the  ear  were  ab- 
solutely insensitive  to  phase—  this  is  not  the  case.  The  ear  is  relative- 
ly insensitive  to  phase  and  the  phase  angle  may  be  varied  only  fairly  wide 
limits,  but  an  extremely  wide  variation  will  cause  a  change  in  the  sensa- 
tion perceived  by  the  listener.    The  ear  is  sensitive  to  the  number  and 
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amplitude  of  harmonics  though;  the  quality  of  the  sensation  depends 
markedly  upon  the  spectrum  of  the  sound. 

One  of  the  most  important  characteristics  of  speech  is  its  great 
redundancy.  Speech  processing  techniques  in  general  remove  vast  quanti- 
ties of  the  information  contained  in  speech  and  yet  the  artificially 
reconstructed  speech  is  intelligible.  The  high  degree  of  redundancy  con- 
tained in  speech  has  been  d era<»i8trated  by  a  number  of  experiments,  Lick- 
lider   has  shown  that  up  to  75^  of  the  speech  waveform,  in  the  time  do- 
main, may  be  removed  with  practically  no  deterioration  in  intelligibility. 
Consider  the  high  degree  of  redundancy  involved  if  one  can  throw  away  75% 
of  the  speech  waveform  and  still  have  intelligibility.  The  success  of 
the  well  known  clipped  speech  systems  in  which  the  only  information  ex- 
tracted from  speech  by  the  speech  processor  is  the  zero  crossings  along 
the  time  axis  is  indeed  amazing  and  further  points  to  the  high  degree  of 
redundancy.  There  are  a  great  number  of  speech  processing  schemes  and  the 
operation  of  each  of  them  depends  upon  the  great  redundancy  of  speech.  The 
success  of  these  schemes  in  itself  is  a  testimonial  to  this  redundancy 
characteristic. 

Another  important  factor  that  must  be  mentioned  in  connection  with 
speech  processing  is  that  of  a  priori  information.  The  a  priori  knowledge 
or  psychological  aet  of  the  human  sensor  with  reference  to  auditory  recogni- 
tion is  another  factor  which  has  enabled  success  in  the  speech  processing 
field.  The  concept  of  psychological  set  is  still  very  hypothetical  and 
little  understood  today.  Generally  speaking  the  human  sensor  appears  to 
possess  a  psychological  set  against  which  the  incoming  acoustic  stimuli  is 
compared  to  achieve  intelligence  and  recognition.  This  concept  is  soialog- 
ous  to  that  in  information  theory  in  which  we  regard  the  receipt  of  signals 

as  providing  evidence  of  the  messages  selected  at  the  transmitter,  such 
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evidence  converting  the  receiver  hypthesis  concerning  the  possible 
messages  from  an  a  priori  set  to  an  a  posteriori  set  from  which  the 
receiver  can  make  a  best  guess  with  a  chance  of  error.  The  ability  of 
the  human  sensor  to  fill  in  distorted  or  unrecognizable  words  in  connect- 
ed text  has  long  been  common  knowledge.  A  possible  explanation  for  this 
phenomenon  is  that  "the  mind  weights  certain  members  of  its  psychological 
set  on  the  basis  of  the  subject  being  discussed  and  selects  that  member 
with  the  highest  probability  of  occurrence  when  a  woixi  is  missed.  The  tre- 
mendous ability  of  the  mind  to  derive  intelligence  from  only  the  barest 
hint  of  information  is  both  a  help  and  a  hindrance  to  the  communication 
engineer.  But  the  help  is  major  while  the  hindrance  only  minor. 

The  communication  engineer  in  evaluating  a  speech  processing  system 
must  determine  to  what  factor  any  success  of  the  system  is  attributable: 
the  speech  processing  scheme  itself  or  the  tremendous  ability  of  the  human 
sensor.  If  a  listener  reports  a  high  intelligibility  score  when  connected 
text  is  used  to  evaluate  a  speech  nrocessing  technique,  doubt  still  re- 
mains about  the  actual  performance  of  the  processing  scheme  itself,  A 
true  evaluation  must  be  based  on  an  evaluation  which  uses  isolated  words; 
an  evaluation  in  vrtiich  the  listener  has  no  change  to  "pre-weight"  certain 
members  of  his  psychological  set.  A  curve  showing  the  relationship  between 
word  and  sentence  intelligibility  is  shown  in  Figure  10. 

Before  discussing  the  aid  a  priori  information  gives  to  speech  pro- 
cessing, a  few  more  general  remarks  about  a  priori  information  itself  will 
be  made.  The  psychological  set  of  the  human  sensor  is  a  product  of  his 
past  environment.  It  seems  fairly  clear  that  the  mind  must  store  informa- 
tion about  what  words  are  expected  to  be  connected  with  some  concept,  some 
idea,  some  topic  of  discussion.  Similarly  information  must  be  stored  about 
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sentence  structure,  word  groupings,  and  t he  set  of  expected  acoustic 
stimuli  from  a  given  speaker.  During  conversation  the  listener  weights 
certain  members  and  subsets  of  his  psychological  set  as  determined  by  th« 
current  environment,  recognition  of  the  speaker,  and  subject  matter.  Thus, 
even  before  the  listener  hears  the  speech  wave  certain  subsets  have  been 
essentially  removed  from  consideration  and  the  probability  of  correct  recog- 
nition is  enhanced.  In  a  discussion  about  abstract  art  one  certainly  does 
not  expect  the  interjection  of  a  sentence  about  the  social  structure  of  ein 
ant  colony. 

Most  people  at  one  time  or  another  have  talked  vdth  some  person  vrtiose 
foreign  accent  was  so  thick  that  initially  it  was  difficult  to  understand 
his  words.  But  after  listening  to  the  speaker  for  some  time  one  notices 
that  it  becomes  easier  and  easier  to  understand  him.  The  mind  lacked  a 
subset  of  expected  acoustic  patterns  in  this  case  and  had  to  create  a  set 
before  a  high  level  of  understanding  was  achievable.  When  the  listener 
again  meets  this  speaker  and  recognizes  him  by  the  sight  mechanism  it 
appears  logical  that  the  listener  weighs  the  particular  subset  for  the 

speaker  and  thus  achieves  more  instant  aural  recognition, 

12 
D.  B,  Fry    has  presented  a  demonstration  of  the  manner  in  vrtiich 

a  priori  knowledge  bears  upon  recognition,  A  phonograph  record  of  the 
conversation  of  two  speakers  was  distorted  so  that  not  a  word  of  the  con- 
versation was  recognized  by  a  group  of  listeners.  After  the  record  was 
played  once  the  listeners  were  told  about  the  subject  of  discussion  between 
the  two  speakers.  When  the  record  was  played  a  second  time  most  listeners 
were  able  to  follow  the  entire  conversation, 

A  priori  knowledge  may  be  reasonably  expected  to  be  a  great  aid  in 
speech  processing  for  listeners  hearing  the  distorted  artifical  speech 
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of  a  processing  scheme  can  build  up  a  subset  of  expected  sounds  and  thus 
bring  about  an   enhancement  of  the  success  of  the  system.  This  phenomenon 
was  observed  during  the  investigation  of  the  particular  speech  processing 
scheme  described  in  this  paper.  After  working  vdth  the  system  for  a 
period  of  time  it  seemed  obvious  to  the  investigator  that  a  certain  group 
of  sounds  were  indeed  a  certain  word.  But,  if  other  listeners  heard  the 
same  group  of  sounds  for  the  first  time,  there  was  an  element  of  doubt  in 

their  recognition,  A  discussion  of  speech  processing  and  a  priori  inforraa- 

4 
tion  is  presented  by  Cherry, 
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4,  Contemporary  Speech  Processing  Systems. 

Speech  bandwidth  reduction  systems  may  in  general  be  grouped  into 
four  principle  categories: 

1,  Time  or  frequency  compression  methods. 
•    2«  Continuous  analysis-synthesis  methods. 

3.  Discrete  sound  analysis-synthesis  methods. 

4«  Sound  group  analysis-synthesis  methods. 

Time  or  frequency  compression  systems  utilize  sampling  or  frequency 

division  techniques,  Ihe  Doppler  Frequency  Compressor  system  falls  within 

13 
this  category.    One  of  the  important  forms  of  redundancy  present  in 

speech  is  repetition  of  the  waveshape  characteristic  of  a  given  sound  dur- 
ing its  generation.  One  could  therefore  obtain  a  2:1  bandwidth  reduction 
by:  1,  sectioning  the  incoming  speech  wave  into  equal  time  sections;  2. 
transmitting  only  information  on  alternate  sections;  3.  reconstructing 
speech  at  the  terminal  end  of  the  system  by  double  playbacks  on  the  infor- 
mation received  on  alternate  sections.  The  Doppler  compression  scheme 
sections  the  incoming  speech  wave  and  then  discards  alternate  sections. 
The  remaining  sections  are  expanded  to  twice  theirnonaal  time  interval 
thus  filling  out  the  blank  time  intervals  generated  above.  The  time  ex- 
pansion results  in  a  compression  of  the  frequency  range  of  the  sections 
to  one  half  of  its  unexpanded  value.  The  reduced  data  which  has  informa- 
tion spread  continuously  along  the  time  axis  is  transmitted  to  the  S3mthe- 
sizer  which  time  comj;-)resses  the  incoming  expanded  sections  to  their  original 
interval.  This  action  expands  the  frequency  range  to  its  original  limits 
and  produces  an  alternating  sequence  of  blank  and  signal  filled  intervals. 
Bach  signal  filled  interval  is  played  twice  by  the  synthesizer  thus  obtain- 
ing a  continuous  output.  Experimental  results  indicate  that  compression 
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ratios  of  1:4  to  1:6  may  be  achievable  by  this  method.   This  system 
operates  especially  well  vn.th  long  vowel  sounds  in  which  the  characteris- 
tic waveform  is  repeated  many  times.  This  scheme  must  be  classed  as  one 
in  which  mild  processing  is  accomplished,  for  at  the  synthesizer  the  alter- 
nate time  compressed  sections  are  exact  replicas  of  the  corresponding  time 
intervals  in  the  incoming  speech  wave  except  for  spurious  noises  caused 
by  the  sampling  mechanism. 

David  and  McDonald^  have  developed  smother  scheme  utilizing  time 
and  frequency  compression  techniques.  The  techniques  involve  a  pitch 
synchronous  processing  of  speech.  The  feasibility  demonstration  of  this 
technique  involved  two  major  processing  steps,  one  of  which  should  not  be 
required  in  an  operational  system.   In  step  one  a  channel  vocoder  was 
used  to  provide  a  convenient  source  of  monotone  speech  for  the  input  to 
the  pitch  synchronous  analyzer.  The  pitch  frequency  for  the  monotone 
speech  was  set  at  and  remained  at  200  cps  during  the  demonstration.  The 
procedure  of  setting  the  pitch  frequency  was  one  of  convenience  and  does 
not  detract  from  the  demonstrated  feasibility  of  the  system.  As  has  been 
stated  before,  during  voiced  sounds  there  is  a  characteristic  repetition 
of  a  basic  v/aveform.  The  function  of  the  pitch  synchronous  analyzer  is 
to  remove  N-1  of  these  repetitions  from  the  incoming  speech  and  process 
the  Nth  period  for  transmission.  The  channel  capacity  required  to  accom- 
modate only  the  information  cwitained  in  the  Nth  period  is  thus  l/U   of 
that  required  for  the  coraplete  sp)eech  signal.  Ttie  synthesizer  reprocesses 
the  information  received  on  the  Nth  period  to  put  it  into  the  proper  time 
and  frequency  frame,  then  plays  the  information  once  and  repeats  it  N-1 
times.  Unfortunately,  speech  frequently  contains  sections  which  show 
little  or  no  periodic  structure.  In  the  demonstration  using  monotone 
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speech  as  an  input  pitch  synchronous  processor  the  unperiodic  sequences 
were  segmented  at  the  same  rate  as  the  voiced  portions.  In  spite  of  this 
arbitrary  sectioning  of  the  unperiodic  sounds,  the  resulting  articulation 
was  better  than  expected.   In  an  operational  system  the  scheme  would  not 
use  a  vocoder  to  provide  monotone  speech  but  would  use  the  actual  pitch 
frequency  as  a  basis  for  segmentation.  The  treatment  of  the  unvoiced  se- 
quences in  an  operational  scheme  still  remains  an  unanswered  question. 
Two  proposals  for  their  treatment  have  been  made:  1. leave  the  unperiodic 
sections  intact  and  code  the  infonaation  using  an  elastic  time  base  to  fit 
the  transmission  channel  required  for  periodic  information;  and  2,  segment 
the  unperiodic  sounds  at  some  arbitrary  rate.  It  is  possible  that  there 
may  be  appreciable  variation  in  the  waveform  between  sampling  intervals. 
In  order  to  overcome  this  problem  it  has  been  proposed  that  the  ^stem, 
instead  of  repeating  the  one  transmitted  sample  N-1  times,  perform  a  linear 
interpolation  at  the  ^nthesizer  between  adjacent  transmitted  samples ^  each 
such  synthesized  period  is  a  step  in  the  interpolation  sequence.  Experi- 
mentally it  has  been  shovm  that  for  N  as  great  as  6,  using  monotone  speech 
as  the  input,  the  processing  did  not  destroy  the  fundamental  phoemic  infor- 
mation. 

The  continuous  analysis  and  synthesis  schemes  are  those  in  which  a 
.  number  of  analogue  control  signals  are  extracted  from  speech  and  trans- 
mitted to  a  synthesizer  where  they  are  used  to  control  the  operation  of 
networks  which  are  functioial  approximations  of  the  human  voice  production 
mechanism.  Ihese  control  signals  are  associated  with  some  parameter  of 
speech  and  carry  information  about  the  activity  of  this  parameter.  For 
instance,  a  control  signal  may  be  associated  with  the  amount  of  energy 
in  a  given  frequency  range  of  speech.  Thus,  for  a  high  control  signal 

31 


level  there  is  associated  a  high  energy  level  in  the  particular  frequency 

band.  There  are  a  number  of  parameters  of  speech  the  rates  of  change  of 

2 

which  are  lindted  to  syllabic  rates  of  change.   The  associated  control 

signals  carry  information  about  the  magnitude  and  thus  the  variation  of 
these  parameters.  Since  the  chosen  parameters  vary  at  syllabic  rates, 
about  15  to  25  cps,  the  control  signals  require  a  bandwidth  of  only  15  to 
25  cps  for  transmission. 

The  goal  of  investigation  in  the  continuous  analysis-synthesis  area 
has  been  and  still  is  to  judicially  select  to  discover  slowly  varying 
parameters  of  speech,  the  utilization  of  which  will  lead  to  the  recon- 
struction of  satisfactory  artificial  speech  with  a  minimum  number  of  con- 
trol signals. 

There  are  a  great  number  of  continuous  analysis-synthesis  speech 
processing  schemes.  An  adequate  reviev;  of  all  of  them  is  beyond  the 
scope  and  purpose  of  this  paper,  A  few  of  the  more  well  known  schemes 
will  be  discussed  in  order  to  point  out  current  trends  in  this  area  and 
to  serve  as  a  backgroiind  for  the  continuous  analysis- synthesis  scheme  pre- 
sented in  this  paper. 

The  Vocoder  is  perhaps  the  prime  example  of  this  type  of  scheme,  -^ 
In  this  scheme  speech  is  broken  up  into  a  na-aber  of  contiguous  frequency 
bands  by  an  analyzer  filter  bank,  '.Fhe  number  of  channels  designates  the 
type  of  vocoder:  12  channel  vocoder,  IS  channel  vocoder.  A  group  of 
analogue  control  signals  ^vhich  are  associated  with  the  amount  of  energy 
in  each  of  the  bands  is  derived  by  anaolitude  detecting  the  outputs  of  the 
analyzer  filter  bank.  The  control  signals  are  transmitted  to  the  synthe- 
sizer where  they  are  used  to  amplitude  modulate  a  local  excitation  function 
falling  in  a  band  corresponding  to  that  from  which  they  were  derived.   The 

32 


local  excitation  function  at  the  synthesizer  is  composed  of  two  types  of 
excitation.  One  type  of  excitation  is  provided  for  voiced  sounds,  another 
type  for  unvoiced  sounds.  A  buzz  generator  whose  output  is  a  harmonic 
spectrum  with  the  fundamental  and  harmonics  to  a  high  degree  provides  ex- 
citation for  voiced  sounds.  The  fundamental  of  the  buzz  generator  is  con- 
trolled by  what  is  called  a  pitch  control  signal.  In  the  vocoder  the  pitch 
control  signal  is  the  output  of  a  filter  which  passes  frequencies  from  100 
to  300  cps  in  the  speech  spectrum.  For  unvoiced  sounds  a  hiss  generator 
provides  broad  and  band  noise  excitation.  The  switching  between  energy 
sources  from  hiss  to  buzz  is  accomplished  by  the  pitch  control  signal. 
When  the  speech  is  unvoiced  there  is  no  current  in  the  pitch  control  chan- 
nel and  a  switch  in  the  synthesizer  automatically  switches  in  the  hiss  gen- 
erator. A  synthesizer  filter  bank  identical  to  the  analyzer  filter  bank 
receives  the  local  excitation  and  breaks  it  up  into  channels  identical  fre- 
quencywise  to  the  analyzer  channels.  Each  channel  in  the  synthesizer  is 
then  amplitude  modulated  by  the  control  signal  derived  from  the  correspond- 
ing analyzer  channel.  The  modulated  signals  from  each  band  are  mixed  to 
produce  the  artificial  speech.  In  essence,  the  system  monitors  only  two 
types  of  parameters:  le  the  energy  in  the  given  frequency  bands j  and 
2.  the  lowest  frequency  present  in  the  spectrum  diiring  voiced  sounds.  The 
associated  energy  control  signal  for  each  band  sets  the  energy  level  for 
a  corresponding  band  of  excitation  produced  at  the  synthesizer. 

In  general,  for  satisfactory  ^nthesized  speech  the  number  of  chan- 
nels has  been  between  10  and  18,  The  control  signals  vary  at  a  rate  of 
approximately  20  cps  so  that  the  bandwidth  required  for  this  system  has 
been  about  300  to  450  cps. 

Stemming  from  the  channel  vocoder  described  above  have  been  the  for- 
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mant  tracking  vocoders  two  of  which  will  be  described. 

In  the  resonance  vocoder   speech  is  broken  \ip   into  four  channels j 
40  to  400;  300  to  1100 j  900  to  3000,  and  3000  to  3000  cps.  In  each  of 
the  three  upper  channels,  two  parameters  are  monitored:  1.  the  total 
energy  in  the  channel j  and  2,  the  average  number  of  zero  crossings  of 
the  filtered  wave  taken  over  a  finite  interval.  The  pitch  control  sig- 
nal is  determined  from  the  lowest  channel  in  the  same  manner  as  the  chan- 
nel vocoder.  The  energy  in  the  lowest  channel  is  also  monitored.  The 
three  upper  channels  are  chosen  such  that  they  bracket  the  frequency  re- 
gions in  which  the  first  three  formants  occur.  It  has  been  determined  ex- 
perimentally that  the  average  channel  frequency  based  upon  the  average 

number  of  zero  crossings  for  each  channel  is  a  fairly  good  approximation 

17 
of  the  formant  frequencies  F^,  ?2f  ^^^  F3,    Two  types  of  excitation  are 

provided  in  the  synthesizer;  buzz  and  hiss.  The  pitch  control  signal 

determines  the  fundamental  of  the  buzz  generator.  The  local  excitation 

function  is  sent  to  three  voltage  variable  resonant  filters  and  to  a  4OO 

cps  low  pass  filter.  The  frequency  control  signals  associated  with  the 

formant  channels  adjust  the  center  frequencies  of  the  variable  filters 

such  that  they  correspond  to  the  average  frequency  of  each  of  the  formant 

channels.  The  outputs  of  the  variable  filters  are  amplitude  modulated  by 

the  associated  energy  control  signals.  The   control  signal  associated  with 

the  pitch  channel  modulates  the  output  of  the  400  cps  low  pass  filter, 

Ihe  type  of  excitation,  buzz  or  hiss,  is  determined  by  a  comparison  between 

the  energy  control  signals  of  the  40  to  4000  and  3000  to  8000  channels  in 

the  synthesizer.   If  the  upper  channel  contains  the  most  energy  the  hiss 

generator  is  switched  in  as  the  local  excitation  function.  The  operation 
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is  completed  with  a  mixing  of  all  the  modulated  output  in  the  synthesizer. 
It  has  been  determined  that  for  a  total  bandwidth  of  approximately  300  cpa 
fair  intelligibility  results. 

The  second  formant  tracking  vocoder  to  be  described  is  a  scheme  de- 

17 
veloped  by  Howard  in  which  seven  parameters  of  speech  are  monitored. 

The  parameters  extracted  are  the  first  and  second  formant  frequencies, 
FuL  and  F2,  their  respective  anqjlitudes,  Ap  and  ky^,   the  voice  pitch  P  the 
ainplitude  of  the  unvoiced  turbulent  sounds  IIL,   and  the  centroid  of  the  tur- 
bulent sound  spectrum  Mq^,  The  control  signals  associated  with  F-^   and  F2 

are  determined  by  averaging  the  zero  crossing  of  the  output  of  two  voltage 
variable  narrow  bandpass  filters.  The  center  frequency  of  each  filter  is 
determined  an  auxiliary  control  signal  which  has  been  developed  from  the 
average  number  of  zero  crossings  at  the  output  of  a  fixed  filter  which  br 
brackets  the  area  in  the  speech  spectrum  in  which  the  given  formant  occurs. 
Ap^  and  App  control  signals  are  determined  by  envelope  demodulating  the 
outputs  of  the  variable  filters  associated  with  formant  frequency  control 
signals.  The  control  signal  for  M^^  is  determined  by  averaging  the  zero 
crossing  for  the  entire  speech  wave.  Mq»8  control  signal  is  derived  from 
an  envelope  demodulation  of  the  entire  speech  wave.  The  turbulent  sound 
control  signals  are  not  transmitted  to  the  synthesizer  during  voiced  sounds. 
Turbulent  sounds  are  synthesized  by  first  amplitude  modulating  the  output 
of  a  wide  band  noise  generator  with  Mq  and  then  selecting  out  a  portion  of 
the  noise  spectrin  with  a  voltage  variable  resonant  filter  whose  center 
frequency  is  determined  by  M-, .  Voiced  sound  synthesis  is  accomplished  by: 
1,  feeding  two  voltage  variable  tuned  filters  in  parallel  with  a  series 
of  short  pulses  the  frequency  of  which  is  controlled  by  Pj  2.  adjusting 
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the  center  frequencies  of  the  variable  filters  with  the  formant  frequency 
control  signals}  and  3«  amplitude  modulating  the  outputs  of  the  respective 
filters  with  Ap.  and  Ap  control  signals.  The  modulated  outputs  of  the 
turbulent  and  voiced  sound  synthesizers  are  mixed  in  the  final  step  of  the 
processing  scheme. 

Quantitative  results  for  this  scheme  have  not  been  published  as  yet. 
The  estimated  bandwidth  for  the  scheme  is  approximately  I40  cps  for  fair 
intelligibility. 

Discrete  and  sound  group  analysis-synthesis  methods  will  be  treated 
together  because  the  basic  philosophy  of  the  methods  is  the  same.  The 
methods  differ  only  in  the  length  of  the  sound  group  operated  upon.  The 
philosophy  of  these  methods  is  to  machine  recognize  a  discrete  sound  unit 
and  transmit  a  coded  group  identifying  the  unit  to  the  synthesizer  for 
voice  reproduction.  Synthesis  may  be  accomplished  by  a  simple  readout  of 
stored  sound  units  from  some  memory  device  or  a  readout  of  a  set  of  stored 
control  signals  to  activate  a  speech  synthesizer  such  as  a  vocoder. 
Phoneme  recognition  schemes  operate  on  the  sound  unit  with  the  smallest 
length.  There  are  40  phonemes  utilized  in  the  English  language  and  a  sys- 
tem which  is  capable  of  recognizing  them  would  require  only  a  60  bit/second 
informatioi  rate  to  convey  voiced  information.  To  date  there  has  been  no 
successful  demonstration  of  a  device  based  upon  phonemic  coding. ° 

Investigations  are  also  being  conducted  on  methods  which  try  to  recog- 
nize groups  of  sounds  that  are  composed  of  more  than  one  phoneme  but  are 
shorter  than  a  word.  The  use  of  pattern  correlation  matrices  operating  on 
sound  spectrum  shapes  is  the  usual  technique  involved  in  the  sound  group 
schemes. 

Recognition  schemes  which  try  to  recognize  entire  words  are  at  present 
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limited  to  very  amall  libraries.  These  devices  recognize  only  a  few  words 
and  then  only  if  the  speaker  for  which  the  machine  has  been  tuned  speaks 
them. 
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5.  A  Speech  Analysis  and  Synthesis  Scheme  for  Bandwidth  Compression. 

The  speech  analysis  and  synthesis  scheme  investigated  in  this  paper 
is  a  data-reduction  scheme.  The  speech  signal  is  destructively  operated 
upon  such  that  a  high  percentage  of  the  redundant  data  in  the  speech  is 
removed.  The  processed  speech  is  then  presented  for  transmission  over  a 
narrow  band  communication  channel.  The  reduced  data  of  the  processed 
speech  is  used  to  control  the  speech  synthesizer  utilizing  local  excita- 
tion functions  to  reconstruct  artificial  speech  at  the  terminal  end  of 
the  system.  The  goal  of  this  data  reduction  scheme  is  to  achieve  a  band- 
width  compression  of  the  channel  necessary  to  transmit  speech  information. 

The  scheme  in  question  and  the  associated  device  break  naturally  into 
two  areas:  analysis  of  the  complete  speech  waveform  to  achieve  data  re- 
duction and  synthesis  of  artificial  speech. 

The  analyzer  operates  on  a  speech  waveform  to  extract  continuously 
seven  low  frequency  coded  signals  as  a  function  of  time.  These  coded 
signals,  which  shall  be  called  control  signals,  are  a  measure  of  seven 
parameters  of  the  complete  speech  wave.  It  is  the  variation  of  these 
seven  parameters  that  is  important.  Variations  in  the  parameters  are 
caused  by  changes  in  the  articxilation  mechanism  of  a  speaker  and  since 
these  articulation  changes  are  restricted  to  lovi  frequency  syllabic  rates 

the  channel  width  required  for  a  transmission  of  each  of  the  seven  para- 

2  17 
meters  is  approximately  20  cps.  ' 

The  major  increase  in  efficiency  comes  from  sending  not  the  complete 
speech  waveform  which  is  complex  but  only  information  to  control  local 
excitation  functions  at  the  synthesizer.  The  data  transmitted  consists  of 
how  the  speech  is  varying  and  is  not  speech  itself. 

The  synthesizer  using  the  incoming  control  signals  to  modulate  local 

38 


excitation  functions,  slniilar  in  general  to  the  physical  sources  pro- 
ducing the  speech,  reconstructs  a  representation  of  the  analyzed  speech 
thus  producing  artificial  speech. 

The  functional  block  diagram  for  the  speech  analyzer  is  shown  in 
Figure  11,  From  this  diagram  it  is  seen  that  the  seven  control  signals 
extracted  from  the  complete  speech  wave  may  be  divided  into  two  basic 
types.  Three  control  signals  consist  of  amplitude  information}  four  con- 
sist of  frequency  information. 

The  scheme  extracts  frequency  and  amplitude  information  from  the 
same  regions  in  the  speech  spectra  with  the  note  that  amplitude  informa- 
tion for  the  pitch  channel  is  not  extracted  from  the  frequency  region 
normally  associated  with  pitch.  For  example,  observe  that  both  amplitude 
and  frequency  information  are  extracted  from  the  region  3000-6000  cps. 

Investigations  carried  out  in  an  allied  speech-processing  area  by 
W,  C,  Dersch  at  lEM,  data  yet  unpublished,  tend  to  indicate  that  the 
optimum  area  to  extract  amplitude  information  may  not  necessarily  be  the 
same  as  the  area  of  extraction  of  frequency  information.  Nor,  does  a  num- 
ber of  frequency  extractors  have  tb  be  the  same  as  amplitude  extractors. 
Further  extensive  investigation  is  required  to  optimize  the  number  and 
placement  of  the  frequency  and  amplitude  extractors  in  the  voice  spectrum. 

The  incoming  speech  waveform  upon  entering  the  analyzer  is  separated 
by  fixed  filters  into  three  frequency  bands;  3OO-I5OO  cps,  I5OO-3OOO  cps, 
and  3OOO-6OOO  cps.  The  outputs  of  the  various  filters  are  sent  to  the 
frequency  and  amplitude  extractors  associated  with  that  particular  channel. 
The  output  of  the  3OO-I5OO  cps  filter  is  also  sent  to  the  pitch  extractor 
circuit, 

"Hie  function  of  the  amplitude  extractors  is  to  derive  an  indication 
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Figure  11. 


Functional  block  diagram  of  the  speed:  analjr^er  showing  the 
development  of  the  seven  control  signals  to  be  transmitted 
to  the  speech  sjrnthesizers 
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of  the  energy  present  in  the  various  major  bands  as  a  function  of  time, 
A  graphical  presentation  of  the  output  of  the  amplitude  extractors  ia 
also  showi  in  Figure  12,  The  amplitude  extractor  takes  the  signal  ema- 
nating from  its  associated  fixed  filter,  envelope  demodulates  it,  smooths 
the  resulting  waveform,  and  filters  the  output  to  allow  only  variations 
of  approximately  20  cps  or  below.  The  circuitry  to  accomplish  these  func- 
tions is  shown  in  Section  6,  Due  to  the  smoothing  action  of  the  demodula- 
tor and  filter  the  resulting  control  signal  cannot  be  said  to  be  an  abso- 
lute instantaneous  measurement  of  the  energy  in  the  band.  It  is  a  very 
close  approximation, 

A  complex  waveform,  over  a  given  tirae  interval,  may  be  completely 
specified  with  a  Fourier  series.  Information  theory  has  shown  that  a 
waveform  may  be  completely  specified  with  the  correct  nvimber  of  discreet 
samples  during  a  given  time.  An  approximation,  it  must  be  admitted  gross, 
to  a  waveform  is  obtained  if  one  specifies  only  the  axis  crossings,  zero 
crossings,  of  the  waveform  and  assumes  that  the  waveform  is  sinusoidal  in 
nature  between  the  zero  crossings.  This  approach  is,  of  course,  the  clip- 

n 

ped  speech  approach  as  discussed  by  Licklider,    Consider  the  extreme 
destruction  performed  on  the  speech  wave  when  only  the  zero  crossings  of 
the  wBve  are  transmitted.  The  surprisingly  high  intelligibility  resulting 
when  tilting  and  differentiation  are  performed  prior  to  clipping  is  Indeed 
factual  evidence  of  the  great  redundancey  of  speech  and  the  small  amount 
of  information  that  must  be  presented  to  the  human  sensor  for  auditory  recog- 
nition. A  communication  system,  Frena,   has  been  developed  in  which  the 
zero  crossings  and  envelope  of  the  speech  waveform  are  transmitted.  This 
device  uses  the  resulting  data  reduction  to  obtain  an  increased  signal-to- 
noise  ratio  rather  than  bandwidth  compression, 
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Figure  12.  Typical  output  waveform  of  Amplitude  Information 
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This  investigation  has  taken  the  approach  that  another  type  of  wave- 
form approximation  is  obtained  if  slope  reversal  information  on  a  complex 
wave  is  utilized.  The  function  of  the  frequency  extractors  is  to  derive 
from  the  complex  wave  at  the  output  of  the  fixed  filters  a  measure  of  the 
average  frequency  of  the  complex  wave  over  a  delta  interval.  This  mea- 
sure being  defined  as  a  short  term  average  of  the  slope  reversals  of  the 
complex  wave. 

The  frequency  extractors  develop  a  pulse  of  given  width  each  time 
the  input  wave  reverses  slope.  An  integrator  operates  upon  the  incoming 
pulse  stream  and  produces  a  varying  DC  voltage  which  is  a  measure  of  the 
short  time  average  of  the  slope  reversals.  Since  the  components  that  pro- 
duce a  change  in  slope  reversal  rates  are  limited  to  syllabic  rates  then 
the  output  of  the  frequency  extractors  will  possess  variations  of  the 
order  of  20  cps.  The  control  voltage  produced  by  the  frequency  extractors, 
as  has  been  stated,  is  a  measure  of  the  average  frequency  of  the  input 
waveform  over  a  delta  interval.  The  integration  time  of  the  frequency 
extractors  is  approximiately  50  msec.  Note  that  the  slope  reversal  infor- 
mation is  obtained  from  the  output  of  the  fixed  filters  and  not  from  the 
complete  speech  waveform.  Figure  13  shows  a  graphical  presentation  of 
the  output  of  the  frequency  extractors. 

Pitch  frequency  information  is  extracted  from  the  frequency  band 
3OO-I5OO  cps.  This  is  a  radical  change  from  the  usual  method  of  pitch 
frequency  information  extraction.  The  usual  approach  has  been  to  use  a 
band  pass  filter  in  the  region  from  100  to  200  cps  to  extract  the  funda- 
mental of  the  Fourier  series  of  the  speech  waveform  and  call  this  the 
pitch  frequency.  *  *  * 

The  frequency  corresponding  to  the  pitch  of  the  male  voice  is  in 
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general  below  200  cps.  For  female  voices  it  may  range  as  high  as  500  cpa. 
An  interesting  phenomena  is  that  the  human  sensor  perceives  pitch  regard- 
less of  whether  a  frequency  corresponding  to  the  pitch  is  present  in  the 
speech  spectrum  or  not.  Consider  the  telephone.  All  frequencies  below 
300  cps  are  not  passed  by  the  system.  Yet,  the  listener  hears  pitch. 
Spectral  analysis  of  speech  waveforms  has  shown  that  very  often  the  fre- 
quency corresponding  to  the  pitch  is  not  present  in  the  speech  spectrum  or 

7 
present  to  a  very  diminished  degree.   Partially  deaf  persons  who  are  deaf 

to  all  frequencies  below  1000  cps,  still  in  voice  conversation  distinguish 

pitch. 

The  view  of  pitch  taken  in  this  investigation  is  based  upon  the  theory 
of  the  residue'  vrtiich  shall  be  discussed. 

Consider  first  the  inadequacy  of  the  system  which  tries  to  extract 
pitch  by  filtering  out  the  fundamental  of  the  Fourier  series  which  at  a 
various  times  is  not  even  present  in  the  voice  spectrum.  No  amount  of 
filtering  is  going  to  extract  a  frequency  that  is  not  present,  A  cogni- 
zance of  this  problem  has  resulted  in  fundamental  "finders"  which  are  com- 
plicated and  often  not  much  mare   proficient  than  the  approach  of  finding 

18  19  20 
the  fundajnental  by  filtering,   *  *    These  "finders"  in  general  attempt 

to  track  two  harmonics  in  the  speech  spectrum  and  from  these  harmonics 

obtain  a  beat  frequency  corresponding  to  the  fundamait&l.  Unfortunately, 

sometimes  the  particular  harmonics  being  tracked  absent  themselves  from 

the  spectrum. 

In  general  the  frequency  corresponding  to  the  pitch  as  perceived  by 

7 
the  human  sensor  is  the  fundamental  of  the  Fourier  series.   But,  some- 
times it  is  not* 

Considering  the  illustrations  of  the  telephone,  spectral  analysis, 

45 


and  partially  deaf  persons,  then  by  what  means  does  the  huzuan  senor  pei»- 
ceive  pitch  when  the  acoustic  stimuli  does  not  contain  a  frequency  corres- 
ponding to  the  pitch?  The  residue  theory  contends  that  a  collective  ob- 
servation of  the  higher  harmonics  of  the  speech  spectra  results  in  the  per- 
ception of  a  sharp  sound,  this  sound  component  being  called  the  residue. 
The   collective  vibration  form  of  these  harmonics  is  periodic  in  nature. 
The  periodicity  of  the  collective  waveform,  which  is  very  apparent  in  th« 
speech  waveforms,  corresponds  frequencywise  to  the  frequency  of  the  resi- 
due. The  periodicity  of  the  collective  waveform  and  the  frequency  of  the 
residue  corresponds  almost  all  the  time  to  the  fundamental  of  the  speech 
spectrum.  In  the  remaining  cases  the  waveform  periodicity  and  residue 
frequency  correspond  to  lower  hannonic  frequencies;  i.e.,  second  or  third. 
In  all  cases,  the  frequency  of  the  pitch  perceived  by  the  human  sensor  is 
the  residue  frequency. 

Based  upon  the  residue  theory,  the  method  utilized  in  this  investiga- 
tion to  determine  a  measure  of  the  pitch  frequency  is  as  follows.   Itie 
pitch  extractor  monitors  the  output  of  the  lowest  frequency  band  fixed 
filter;  that  is,  300  to  I5OO  cps.  During  voiced  speech  the  collective 
waveform  of  the  harmonics  in  the  band  300  to  I5OO  cps  is  periodic.  The 
pitch  extractor  develops  a  sinusoidal  waveform  vrtiose  frequency  corresponds 
to  the  periodicity  of  the  speech  waveform  in  this  band.  It  has  been 
found  unnecessary  to  observe  the  complete  speech  spectrum;  the  periodicity 
of  the  unfiltered  speech  waveform  being  the  same  as  the  periodicity  in  the 
band  from  3OO-I5OO  cps.  The  pitch  extractor  is  composed  of  an  envelope 
demodulator  and  a  low  pass  filter  network.  The  circ\iitry  is  shown  in 
Section  6. 

Hie  output  of  the  pitch  extractor  is  sent  to  a  frequency  extractor 
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circuit  which  develops  a  control  voltage  which  ia  a  measure  of  the  average 
frequency  of  the  sinusoidal  waveform  at  the  output  of  the  pitch  extractor 
over  a  delta  interval.  Figure  14  is  a  series  of  photographs  of  the  wave- 
form  at  the  output  of  the  300-1500  fixed  filter,  and  the  resulting  sinu- 
soidal wave  at  the  output  of  the  pitch  extractor  for  three  voiced  sounds. 

The  functional  block  diagram  for  the  speech  synthesizer  is  shown 
in  Figure  15.  Ihe  function  of  the  speech  synthesizer ^is  to  utilize  the 
seven  incoming  control  signals  to  continuously  synthesize  speech, 

"Hie  frequency  information  control  signals  operate  to  select  the  posi- 
tion of  the  passband  in  four  voltage  variable  filters.  The   action  of  the 
voltage  variable  filter  has  been  quantized.  For  example,  when  the  fre- 
quency control  signal  for  the  sub-band  300-1500  cps  varies  continuously 
from  a  voltage  that  corresponds  to  300  cps  to  a  voltage  that  corresponds 
to  1500  cps,  the  center  frequency  of  the  passband  of  the  associated  volt- 
age variable  filter  does  not  move  continuously  from  300  to  1500  cps  but 
moves  discreetly  in  a  series  of  seven  steps.  Thus,  the  center  frequency 
of  the  filter  remains  at  300  cps  for  control  signal  values  corresponding 
to  frequencies  of  300  to  400  cps.  At  400  cps  the  passband  center  shifts 
to  500  cps  and  remains  there  until  the  control  signal  reaches  a  value 
corresponding  to  600  cps.  This  procedure  is  followed  in  all  of  the  volt- 
age variable  filters,  "Hie  filter  shifts  from  one  center  frequency  to  the 
next  at  a  frequency  which  is  midway  between  the  quantized  filter  center 
positions.  Table  1  lists  the  quantized  center  frequency  positions  of  the 
voltage  variable  filters.  The  passband  of  the  four  filters  are:  20  cps 
for  the  sub-band  100  to  200  cpsj  200  cps  for  the  sub-bands  300  to  I5OO  cps 
and  1500  to  3000  cps J  and  300  cps  for  the  sub-band  3OOO-6OOO  cps.  Differ- 
ent passbands  were  used  in  the  various  sub-bands  for  two  reasons.  First, 
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AH— Fixed  Filter  Output 


All — Fitch  Extractor  Output 


OH— Fixed  filter  Output 


OH— Pitch  Extractor  Output 


Figure  14.  Output  of  fixed  bandpass  filter,  300  to  1500  cps, 
arid  corresponding  output  of  Pitch  Extractor  for 
three  voiced  sound  inputs:  EE,  AH,  and  OH. 


48 


Control 
Signals 


F,  (3000-6QOQ) 


Voice  Characterized 
Band  Li-nited  3 000- 60 X 
Sound  Generator 


Voltage 

Variabl! 

Filter 


Amplitude 
Modulator 


Voice  Characterized 

Band  Li-nited  1500-300(t) 
Sound  GenerAtnr 


F2  (1500-3000) 


Voltage 
^Variable 
Filter 


Amp] itude 
Modulator 


Voice  Characterized 
Band  Limited  300-1500 
Sound  Generator 


Fi   (300-1500) 


Voltage 

Variable 

niter 


Amplitude 
Modulator 


Voice  Characterized 
Band  Limited  100-200 
Sound  Generator 


Voltage 

Variable 

Filter 


Amplitude 
Modulator 


-> 
-> 
-^ 
-^ 


Mixer 


Km 


Figure  15. 


Functional  block  diagram  of  the  speech  synthesizer 
showing  seven  input  control  signals  and  speaker  output 
of  artificial  speech. 
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the  type  of  information  connected  with  the  lower  sub-band  is  narrower 
than  the  type  present  in  the  upper  sub-band.  The  lower  sub-band  is  con- 
cerned with  pitch  information  the  upper  sub-band  involves  mainly  wide 
band  fricative  information.  Second,  the  abilit  of  the  ear  to  differentiate 
between  frequencies  becomes  jx>orer  as  frequency  increases. 

Each  voltage  variable  filter  filters  the  output  of  its  own  associated 
unique  sound  generator,  the  filter  position  being  determined  by  the  corres- 
ponding frequency  control  signal.  A  survey  of  the  literature  has  shovm 
that  in  other  speech  synthesis  schemes  functionally  comparable  sound  gen- 
erators are  almost  always  in  the  form  of  buzz  and  hiss  generators  or  os- 
cillators. The  operation  of  these  devices  is  well  understood.  Here, 
instead  of  presenting  to  the  filter  the  band  limited  white  noise  of  hiss 
generators  or  the  harmonically  rich  output  of  the  buzz  generators  and  os- 
cillators, the  approach  taken  is  to  present  to  the  filter  band  limited 
voice  characterized  sound.  The  actual  implementation  of  the  sound  genera- 
tors may  procede  along  a  number  of  approaches.  Tracks  on  a  magnetic  drum 
may  be  utilized.  A  single  continuous  groove  on  a  phonograph  record  may  be 
used.  The  method  used  in  the  investigation  was  to  pass  a  single  continuous 
loop  of  magnetic  tape  through  a  tape  recording  device.  There  were  four 
tracks  on  the  tape.  Each  of  the  four  tracks  is  associated  with  a  major 
frequency  band  in  the  synthesis  scheme.  That  is,  one  track  is  associated 
with  the  band  3000  to  6000  cps,  one  with  the  band  I5OO  to  3000  cps,  one 
with  the  band  300  to  I5OO  cps  and  one  with  the  band  100  to  200  cps.  The 
sound  on  each  track  is  the  result  of  a  person  or  groups  of  persons  speak- 
ing through  a  fixed  bandpass  filter  whose  limits  correspond  to  the  frequen- 
cies mentioned  Just  above.  Recording  is  done  at  an  unsaturated  level. 
After  several  cycles  a  track  on  the  continuous  tape  loop  is  over  recorded 
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many  times  and  is  thus  saturated  with  band  limited,  voice  characterized 
sound.  This  sound  is  not  pure  noise,  but  is  sound  which  has  the  speech 
characteristics  of  the  selected  channel.  The  sound  generator  thus  possesses 
characteristics  of  the  human  voice  production  device. 

The  outputs  of  the  sound  generators  are  one  of  the  two  inputs  to  the 
voltage  variable  filters.  The  other  input  to  each  of  the  variable  filters 
is  the  frequency  control  signal  associated  with  that  channel. 

The   use  of  these  sound  generators  is  indeed  empirical.  It  is  a  hjrpo- 
thesis  of  this  scheme  that  the  use  of  band  limited,  voice  characterized 
sound  will  lead  to  increased  intelligibility  and  naturalness  in  the  syn- 
thesized speech.  Research  on  speech  sounds  themselves  has  shown  that  the 
use  of  superposed  samples  results  in  a  sound  which  displays  the  average 
spectral  properties  of  speech  more  readily  than  the  methods  that  have  bean 
employed , 

The  outputs  of  each  of  the  variable  filters  is  amplitude  modulated  by 
its  associated  amplitude  control  signal. 

It  will  be  recalled  that  the  frequency  corresponding  to  the  pitch  was 
determined  by  observations  on  waveform  periodicity  in  the  lower  frequency 
band  channel.  This  frequency,  in  general,  for  male  voices  is  between  100 
and  200  cps  so  that  while  there  is  no  analysis  done  on  speech  in  the  100 
to  200  cps  region  there  must  be  a  sound  generator  and  variable  filter  in 
the  synthesizer  for  this  region  in  order  to  synthesize  the  pitch  sound. 
The  output  of  the  pitch  channel  variable  filter  is  amplitude  modulated  by 
the  amplitude  control  signal  of  the  300  to  1500  cps  channel.  This  ampli- 
tude control  modulates  the  output  of  the  variable  filter  associated  with 
the  300  to  1500  cps  channel.  This  illustrates  the  concept  discussed  earl- 
ier that  the  frequency  and  amplitude  channels  must  not  necessarily  cover 
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the  same  range  in  the  voice  spectrum. 

The  outputs  of  the  modulators  are  resitively  mixed,  amplified,  and 
passed  to  th©  output  speaicer,  the  synthesis  of  artificial  speech  being 
complete. 

The  system  block  diagram  is  shovm  in  Figure  16. 

Figures  17-20  are  photographs  of  oscilliscope  presentations  at  vari- 
ous points  throughout  the  system  for  the  word  "six".  Figure  17  shows  the 
input  waveform  to  the  system  and  the  synthesized  output  waveform.  Figure 
18  shows  the  output  of  the  three  fixed  analyzer  filters.  Figure  19  shows 
the  four  associated  frequency  control  signals.  The  corresponding  ampli- 
tude control  signals  are  shown  in  Figure  20, 
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TJkE  SCALE 
100  MS/CU 


Figiore  17.  Top:  Audio  output  waveform  of  system  synthesizer 

for  input  word  "six". 

Bottom:  Audio  input  waveform  to  system,  for  word  "six". 


TIME  SCALE 
100  MSBC/CM 


Figure  18.  Output  waveforms  of  analyzer  fixed  filters  for  word 
"six".  Top,  3000  to  6000  cps  band.  Middle,  1500  to 
3000  cps  band.  Botton,  300  to  1$00  cps  band. 
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Figure  19.     Frequency  control  signals  for  word  "six" 
From  top  to  bottom: 

1.  300  to  1500  cps  band. 

2.  Pitch  control  signal. 

3.  1500  to  3000  cps  band. 

4.  3000  to  6000  cps  band. 
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Figure  20.  Amplitude  control  signals  for  word  "six". 
Top  to  bottom: 

1.  3000  to  6000  cps  band. 

2.  1500  to  3000  cps  band, 

3.  300  to  1500  cps  band. 

Note  in  the  3000  to  6OOO  cps  band  the  buildup 
of  energy  during  the  "s"  sounds  and  the  drop-off 
during  the  voiced  »i"  sound.  In  the  300  to  I5OO 
cps  band  observe  the  lack  of  energy  in  the  band 
for  all  sounds  except  the  voiced  "i"  sound. 
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6,   Implementation  of  Speech  Processing  Scheme 

The  design  level  set  during  the  investigation  was  based  upon  three 
philosophies.  First,  a  degree  of  looseness  is  permitted  and  normal  for 
the  investigation  and  demonstration  of  a  concept  at  the  laboratory  level. 
Second,  the  gap  between  the  laboratory  device  and  a  functionally  equivalent 
commercial  product  should  be  kept  at  a  minimum  and  be  easily  traversed  by 
simple  product  engineering.  Third,  when  an  element  normally  not  associat- 
ed with  a  given  function  is  utilized,  intensive  design  research  and  a  more 
tightly  engineered  component  is  demanded  in  order  to  evaluate  it  both  from 
a  device  and  system  standpoint. 

The  third  philosophy  characterized  the  voltage  variable  bandpass  fil- 
ter utilized  in  the  speech  processing  system.   The  requirements  set  for 
this  device  were  found  to  be  higher  than  those  currently  being  observed 
by  investigators  in  closely  allied  speech  processing  research.  The  volt- 
age variable  filter  is  considered  to  be  a  key  element  in  the  system  and 
as  such  had  greater  demands  placed  upon  it.  Much  consideration  was  given 
during  the  design  stage  to  the  possibility  that  an  inverse  relationship 
might  exist  between  system  intelligibility  and  filter  performance.   Be- 
cause of  the  critical  nature  of  the  bandpass  filter  a  great  deal  of  effort 
arKl  time  was  spent  in  the  choise  of  a  circuit  and  its  development.  As  a 
result,  the  treatment  of  the  voltage  variable  filter  is  far  more  extensive 
than  for  other  system  components. 

The  design  and  construction  of  the  various  functionaJL  components  of 
the  speech  analysis  and  synthesis  system  was,  for  the  main  part,  straight- 
forward. The  finalized  circuits  for  the  more  straightforward  con^jonents 
shall  be  presented  and  discussed  only  briefly.  A  more  intensive  discuss- 
ion will  be  presented  for  those  components  which  posed  a  more  serious 
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problem. 

Referring  to  Fig.  16,  we  see  that  the  speech  vraveform  after  passing 
through  the  microphone,  is  passed  through  a  voltage  amplifier.  This  volt- 
age anplifier  is  of  standard  design.  The  output  of  the  voltage  amplifier 
is  then  sent  to  three  SKL  Model  302  filters.  The  bandwidth  and  center 
frequency  of  each  of  the  pass  bands  may  be  varied  by  manual  adjustment. 

The  circuitry  for  the  an^jlitude  information  extractor  is  shown  in 
Fig,  21.  It  consists  of  a  standard  envelope  demodulator,  a  half-wave 
rectifier,  followed  by  three  low-pass  filters.  The  lovH-pass  filters  per- 
form two  functions.  They  smooth  the  wave  form  and  permit  only  variations 
of  20  cps  or  below.  The  output  is  from  an  emitter  follower. 

Fig.  22  shows  the  circuitry  of  the  frequency  information  extractor. 
The  frequency  extractor  is  composed  of  three  sections t  a  slof>e  reversal 
detector,  a  monostable  multivibrator,  and  an  integrator.  The  slope  rever- 
sal detector  developes  a  trigger  pulse  for  the  multivibrator  each  time  the 
input  wave  reverses  slope  from  negative  to  positive.  The  multivibrator 
emits  a  train  of  constant  width  pulses  which  are  shojrt-term  averaged  by 
the  integrator.  The  integration  time  is  approximately  50  milliseconds. 
The  circuitry  shown  in  Fig.  22  is  for  the  frequency  band  from  1500  to  3000 
cps.  The  RC  time  constants  of  the  multivibrator  and  integrator  must  be 
varied  slightly  to  accommodate  the  other  major  sub-bands,  A  picture  of 
the  four  frequency  information  extractors  is  shown  in  Fig.  23. 

The  pitch  extractor  shown  in  Fig.  24  consists  of  an  envelope  demodu- 
lator followed  by  two  constant  k  low-pass  filters.  The  function  of  the 
pitch  extractor  being  to  develop  a  sinusoidal  wave  whose  frequency 
corresponds  to  the  periodicity  or  pitch  frequency  of  the  speech  wave  for 
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Figure  23.  A  photograph  of  the  four  modularized  Frequency  Information 
Bsctractors,  The  chassis  contains  the  Pitch  Extractor,  The  plug  in 
devices  on  the  front  provide  transistor  power  supply  terminals,  signal 
input-output  terminals  for  all  units,  and  test  point  terminals  for  access 
to  three  test  points  per  module. 
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presentation  to  a  frequency  information  extractor. 

Fig.  16  shows  the  block  diagram  for  the  speech  synthesis  scheme. 
The  noise  generator  has  been  previously  discussed.  The  modulator  cir- 
cuitry for  the  synthesizer  is  shown  in  Fig.  25.  Here  the  outputs  of 
the  various  voltage  variable  filters  are  acoplitude  modulated  by  the 
control  signals  derived  from  the  amplitude  information  extractors. 
The  outputs  of  the  modulators  are  resistably  mixed,  passed  through  a 
stage  of  voltage  an^Dlification  and  a  stage  of  power  amplification  to 
the  output  speaker. 

The  development  of  a  voltage  variable  bandpass  filter  for  use  in 
the  audio-frequency  range  poses  a  serious  problem  with  stringent  restraints. 
First  of  all,  the  passband  must  be  essentially  constant  for  a  center  fre- 
quency variation  of  nearly  20  to  1,  Secondly,  the  amplitude  of  the  passed 
signal  for  a  constant  amplitude  input  wave  must  remain  constant  for  the 
same  20  to  1  center  frequency  variation,  namely  300  to  6000  cps. 

Prior  considerations  as  to  the  passbands  for  the  filters  has  lead  to 
the  requirements  of  a  200  cps  bandwidth  at  the  half -power  points  for  the 
sub-band  300  to  1500  cpsj  a  similar  200  cps  bandpass  for  the  sub-band 
1500  to  3000  cpsj  and  a  300  cps  bandpass  for  the  sub-band  3000  to  6OOO, 
A  much  narrower  20  cps  bandpass  filter  is  needed  in  the  pitch  information 
channel. 

With  a  view  toward  evaluating  the  stated  concept  of  the  speech 
analysis  and  synthesis  bandwidth  corapressicn  scheme  and  at  the  same  time 
developing  devices  which  could  be  part  of  a  workable,  non-laboratory  model, 
it  is  felt  that  any  proposed  filter  must  be  judged  on  a  size,  an  economic, 
a  weight  and  a  simplicity  criteria. 

Consider  first  the  use  of  standard  LC  filters  in  a  T  arrangement, 
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The  use  of  this  typo  of  filter  appears  unprofitable  from  several  points 
of  view.  The  size  of  the  required  inductances  for  use  in  the  audio  region 
is  prohibitive.  The  shift  of  the  passband  as  a  function  of  some  control 
voltage  requires  that  either  the  L  or  C  components  of  the  filter  be  varied 

continuously  or  in  discrete  steps.  Variation  of  the  L  components,  using 

22 

Increductors ,    to  shift  the  passband  requires  sizeable  auxiliary  circuits. 

Voltage  variable  capacitors  are  commercially  available  at  the  present  time, 
but  their  intrinsic  capacitance  and  their  dynamic  range  are  as  yet  far  too 
small  to  be  of  practical  use  in  the  audio  region.  The  C^  of  the  inductances 
varies  with  frequency,  thus,  the  passband  itself  would  also  be  a  function 
of  frequency.  Also,  any  variation  of  the  components  to  shift  the  passband 
would  result  in  a  change  in  characteristic  impedance  of  the  filter,  so  that 
for  proper  operation  of  the  filter  the  terminating  impedance  would  also 
have  to  be  varied. 

The  tuned  circuit  provides  another  means  by  which  filtering  may  be 
accomplished.  Simplicity  is  the  prime  advantage  of  the  tuned  circuit  fil- 
ter. Here  again,  the  high  LC  product  required  for  operation  in  the  audio 
region  presents  serious  disadvantages  with  reference  to  required  size  and 
availability  of  suitable  voltage  varying  components.  But  it  is  the  very 
nature  of  operation  of  the  network  itself  that  prevents  utilization  of  the 
tuned  circuit  in  the  bandwidth  compression  scheme.  Consider  the  require- 
ments for  the  bandpass  filter.  First,  the  bandwidth  must  remain  constant 
over  a  wide  range  of  frequencies.  Second,  the  amplitude  of  the  passed 
signal  mist   not  vary  with  frequency.  The  Q  of  the  resonance  curve  deter- 
mines the  bandwidth  of  the  filter.  That  is. 

For  a  constant  Af  bandwidth  this  required,  for  instance,  in  the  sub-band 
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from  3000  to  6000  cps  a  q  which  varies  directly  vd.th  frequency.  At  3000 
cps  for  a  bandwidth  of  300  cps  the  required  (^  i3  lOj  for  6000  cps,  a  Q 
of  20,  As  the  frequency  increases,  so  must  Q  . 

The  following  circuit  is  a  simple  tuned  circuit  bandpass  amplifier 
where  the  passband  is  shifted  by  means  of  Increductors, 
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.The  equivalent  circuit  of  this  tuned  amplifier  is: 


This  circviit  may  be  redrawn  as  follows. 
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The  resonant  frequency  of  this  circuit: 
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Substituting  for  L,  L    ~      "n",    "*       — 


■Hius,  it  is  seen  that  as  the  passband  is  shifted  by  varying  L,  the  Q  of 
the  circuit  varies  inversely  vrLth  frequency.  That  is,  as  the  frequency 
increases,  Q  decreases.  This  is  exactly  opposite  to  the  required  perform- 
ance of  the  filter.  Therefore,  the  use  of  a  tuned  filter  is  not  possible 
in  this  case  with  the  stated  specifications.  Also  consider,  if  the  ampli- 
tude of  the  passed  signal  is  to  remain  constant  throughout  the  sub-band, 
the  impedance  of  the  circuit  as  seen  by  the  current  generator  must  remain 
constant. 

Neglecting  the  shunt  capacitances,  the  load  for  the  current  generator 
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Now,  as  L  is  varied  to  shift  the  passband,  as  shown  below,  it  can  now  be 

seen  that  as  the  passband  is  shifted  and  the  center  frequency  increases, 

the  magnitude  of  Z+  decreases,  R  is  a  constant  so  that  the  magnitude  of 

the  load  for  the  current  generator  varies  with  the  magnitude  of  the  tank 

impedance.  Thus,  the  output  amplitude  varies  with  the  center  frequency 

and  the  second  requirement  of  the  filter  is  not  met. 

_j 

Substituting  ^  ~  Oo^C  </ 

Investigations  have  been  made  using  the  tuned  circuit  as  a  bandpass 
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16,17 
filter  in  the  audio  region  and  accepting  the  resulting  filter  limitations 


A  very  excellent  continuous  filter  has  been  designed  and  built  by 

23 
Fant,       Filtering  is  accomplished  by  a  series  of  heterodyning  actions  us- 
ing fixed  filtei^.     The  heterodyne  filter  provides  a  constant  bandwidth 
which  is  essentially  independent  of  audio  frequency.     The  bandwidth  is 
also  easily  modified,     Fant 'a  filter  provided  cutoffs  with  a  maxi mal    slope 
of  1  db  per  ops  and  an  ultimate  of  over  60  db  of  attenuation.     The  opera- 
ticai  of  the  heterodyne  filter  is  shown  in  Fig.   26.     The  heterodyne  filter 
as  designed  by  Fant  was  to  operate  in  the  45  to  4000  ops  region.     There- 
fore,  in  Fig.   26,  the  input  signal  is  passed  through  a  low-pass  filter, 
removing  components  of  the  spectra  above  4000  cps.     The  ultimate  object 

of  the  filter  is  to  pass  a  band  of  frequencies  AJ-  from  F     to  F„  as  shown 

L  H 

in  Fig.   26.     As  shown  In  Fig.  26b,  the  input  signal  from  the  low-pass 
filter  of  Fig.  26a  is  heterodyned  with  a  frequency  f ,  =   F-,  ■♦- Ft   where  ?■* 
is  the  fixed  cutoff  frequency  of  the  low-pass  filter  shown  in  Fig.   26b, 
and  Ft    is  the  desired  lower  limit  of  the  ultimate  passband,     Ihe  result- 
ing signal  has  the  upper  sideband  removed  by  the  low-pass  filter  of  Fig, 
26b,     The  lower  sidebaind  is  passed  through  the  bandpass  filter.  Fig.  26c. 
The  signal  from  the  bandpass  filter  is  then  heterodyned  with  £2  whose 
placement  along  with  the  cutoff  frequency  ?„  of  the  low-pass  filter  of 
Fig.   26c  determines  the  desired  upper  frequency  F„  of  the  ultimate  pass- 
band.     The  remaining  band  of  frequencies  is  then  heterodyned  with  f-j  to 
place  it  in  its  proper  position  in  the  frequency  spectra. 

The  high  performance  and  versatility  of  the  heterodjme  filter  has 
much  to  offer.     Unfortunately,  the  complexity  and  size  of  the  circuitry, 
Fant's  filter  was  an  eight  rack  device,  precludes  any  reasonable  use  in 
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a  workable  model. 

The  use  of  crystsd  filters  and  heterodyning  techniques  holds  high 
promise  as  an  efficient  means  to  accomplish  the  roouired  filtering.     High 
stability,  variable  frequency  oscillators  are  the  prime  requirement  of 
this  approach.     The  concept  here  is  to  mix  the  band  of  frequencies  in  th« 
audio  region  to  be  filtered  with  some  variable,   high-frequency,  carrier, 
pass  the  lower  or  upper  sideband  thirough  a  fixed  crystal  filter,  and  then 
heterodyne  the  passed  band  back  to  the  audio  region.     The  passband  shift 
would  be  accomplished  by  moving  the   sideband  relative  to  the  fixed  crystal 
filter  by  varying  the  initial  high-frequency  carrier.     This  system  has  the 
attribute  of  constant  bandwidth  and  constant  amplitude  output  for  a  con- 
stant amplitude   set  of  input  audio  frequencies.     Variations  in  the  desired 
passband  may  be  accomplished  by  two  means.     Crystals  having  the  same  reson- 
ant frequency  but  different  Q' s  may  be  picked;  or  a  crystal  having  a  higher 
resonate  frequency  but  a  given  Q  may  be  chosen.     The  bandwidth  being  deter- 
mined byAf  c   ^o*     ^^^  example,   a  crystal  having  a  resonant  frcquenc'  of  1 
megacycle  and  a  Q  of  20,000  provides  a  bandpass  of  50  cps,  while  a  crystal 
having  the  same  resonant  frequency  but  a  Q  of  10,000,  has  a  bandpass  of 
100  cps.     Similarly,  a  2  megacycle  crystal  with  a  Q  of  20,000  has  a  band- 
pass of  100  cps  and  a  2  megacycle  crystal  of  a  Q  of  10,000  has  a  200  cps 
bandv^idth. 

This  particxilar  approach  was  not  followed  in  the  investigation  be- 
cause of  a  desire  to  find  an  equally  efficient  method  of  filtering  in  which 
the  filtering  would  be  done  in  the  audio  region,     Ihus,  the  problems  of 
high  stability  oscillators,  heterodyning,  and  a  larger  volume  of  circuitry 
could  be  avoided, 

RC  active  filters  for  high-pass,   low-pass  and  bandpass  filters,  have 

70 


provided  the  basis  in  recent  years  for  another  type  of  electronically 
controlled  audio  filter^..      The  RC  active  filter  has  the  ability  to 
provide  characteristics  corresponding  to  those  of  the  usual  types  of  RLC 
passive  filters.     In  this  device,   a  negative  impedance  converter  is  used 
in  addition  to  passive  RC  elements.      The  sum  of  the  capacitors  in  the 
circuit  is  equal  to  the  sum  of  the  reactances  in  the  correspo»ding  RLC 
filter.     The  normal  inband  loss  associated  vdth  RC  passive  filters  are 
greatly  reduced  by  the  active  element.     The  block  diagram  for  a  RC  active 
filter  is  shovm  below. 


RC. 
Network 

f?C 

< 

The  negative  impedance  converter  is  an  active  four-terminal,  four-pole, 

vrtiich  presents  at  the  input  terminal  pair  the  negative  of  the  impedance 

24 
connected  to  the  output  terminal  pair.         The  transfer  impedance  for  a 

lumped  element  filter  may  be  written   ^7-CSj    -  '  vrtiich  for  the  RC 

active  filter  is        ^ 


-^ 


^'-Jlp.o 


2/ 


Where  the  negative  sign  before  Z;|L]Lb  ^®  provided  by  the  negative  impedance 
converter. 

The  design  of  the  circuit  is  basically  simple.     The  zeros  of  D(s)  are 
chosen  at  the  desired  natural  frequencies  of  the  completed  stmcture. 
From  this  the  driving  point  impedance  for  the  structures  a  and  b  are  cal- 
culated.    The  structure  form  is  selected  to  provide  zeros  of  transmission 
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at  the  required  frequencies,  these  are  the  zeros  of  N(s). 

Based  on  the  work  of  Linvill,  Dolansky  developed  voltage  variable 
high-pass  and  low-pass  filters  which,  when  arranged  in  series,  provide  a 
voltage  variable  bandpass  filter.  The  simplified  diagrams  of  the  low 
and  hl^h-pass  filters  are  shown  in  Fig.  27.  Using  the  Miller  effect,  to 
provide  the  voltage  variable  capacitance,  Dolansky' s  circuit  required  a 
three-tube  circuit  par  variable  capacitance  in  the  control  stage.  The 
variable  inductances  are  saturable  inductors  whose  inductance  depends 
upon  the  degree  of  core  saturation. 

The  audio  filter  as  developed  by  Dolansky  provided  a  cutoff  slope 
of  17  db  per  octave.  The  use  of  Increductors  in  the  circuits,  although 
providing  the  variatiwi  required,  leads  to  undesirable  effects.  Hysteresis 
causes  the  inductance  to  vary  about  10  percent  for  the  same  control  current, 
The  bandpass  varies  with  frequency  because  of  the  Q  variation  In  the  induct- 
ance. The  circuitry  is  sizable. 

It  is  thus  felt  this  time  that  there  are  better  and  simpler  circuits 
to  provide  a  variable  filter. 

The  approach  taken  and  the  voltage  variable  filter  that  was  designed 
and  built  for  the  investigation  may  at  first  appear  to  be  awkward  and  to 
be  the  hard  way  of  doing  things.  But,  the  system  was  developed  with  the 
future  state  of  the  art  in  mind.  It  is  felt  that  within  two  or  three 
years,  a  voltage  variable  capacitance  will  be  produced,  having  the  requir- 
ed intrinsic  capacitance  and  dynamic  range,  that  will  make  the  design  sys- 
tem a  highly  efficient  but  simple  method  for  variable  filtering  in  the 
audio  region.  It  is  believed  that  the  superiority  of  the  system  that  can 
be  attained  with  the  use  of  proper  voltage  variable  capacitors  more  thajn 
offsets  the  circuit  conqDlexity  needed  at  present  to  implement  the  concept 
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with  currently  available  components. 

The  filter  conaifits  of  a  Twin  T  rejection  filter  in  the  negative 
feedback  path  of  an  aijplifier.     Ihe  gain  of  the  amplifier  is  reduced  by 
the  negative  feedback  at  all  frequencies  except  the  rejection  frequency 
of  the  Twin  T  filter,     A  block  diagram  for  the  system  is  shown  below. 


o^. 


The  transfer  curve  for  the  Twin  T  Filter  is 


As  is  seen,   the  Twin  T  passes  all  frequencies  except  those  in  the  notch. 
Thus,  there  is  negative  feedback  to  the  amplifier  at  all  frequencies 
except  the  rejection  frequency.     The  resulting  characteristic  for   the 
system  is  then 
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The  Twin  T  circuit  consists   of  three  resistances  and  three  capacitances 
as  shown  below. 
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The  problem  of  making  the  Twin  T  filter  voltage  tunable,  varying  in 
accordance  with  some  control  voltage  poses  an  interesting  problem.  In 
order  to  shift  the  rejection  frequency  of  the  Twin  T  and  thus  shift  the 
passband  of  the  filter,  either  all  of  the  resistive  elements  or  all  of 
the  capacitances  must  be  varied  together.  Voltage  variable  resistors  are 
available,  but  they  are  non-linear  with  voltage  and  the  problem  of  main- 
taining a  match  between  resistors  is  extremely  difficult.  Another  inter- 
esting scheme  considered  was  to  use  a  photocell  as  a  variable  resistance. 
The  resistance  is  varied  by  changing  the  light  intensity  incident  upon 
the  photocell,  A  neon  tube  was  considered  as  a  possible  light  source. 
Experiments  showed  that  the  light  intensity  eminating  from  the  neon  tube 
to  be  non-linear  with  voltage  except  in  narrow  regions.  The  possibility 
of  using  a  magic  eye  tube  and  intensity  modulating  the  electron  flow  was 
considered.  This  approach  appears  to  have  some  merit,  but  was  net  fully 
investigated  as  the  basic  plan  to  use  piiotocells  as  a  variable  resistance 
proved  too  difficult  to  implement.  It  is  very  difficult  to  match  piioto- 
cells,  both  dynamically  and  statically,  to  give  the  same  resistance  for 

the  same  light  intensity. 

Voltage  variable  capacitors,  Vericaps  currently  available,  as  has 
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been  said,  do  not  possess  the  proper  parameter  size  and  range  for  use 
in  this  frequency  region.  Currently  available  Vericaps  have  a  maximum 
capacitance  of  the  order  of  300  to  350  raicrcmicrof arado ,  It  is  felt 
that,  when  the  capacitance  of  available  Vericaps  is  of  the  order  of 
1000  raicroralcrofarads  or  larger,  they  may  be  used  practically  as  voltage 
variable  components  in  the  Twin  T  filter. 

The  junction  capacitance  of  a  semiconductor  diode,  as  is  well-known, 
is  voltage  variable.  As  the  back  bias  to  a  p-n  diode  is  varied,  the 
barrier  width  changes  and  thus,  its  capacitance.  Experiments  conducted 
on  a  1N1084  silicon  diode  showed  a  168:1  variation  in  capacitance  for  a 
back  bias  variation  of  approximately  50  volts.  Unfortunately  the  non- 
linearity  of  the  capacitance  and  the  difficulty  in  matching  diodes  pre- 
cluded their  use  In  the  circuit. 

The  solution  of  the  problem  lead  to  a  circuit  vhich,  aside  from 
providing  a  voltage  tune  filter,  is  unique  in  itself.  The  circuit  is  a 
marriage  of  transistors,  tubes,  and  relays.  Due  to  the  great  difficulty 
in  obtaining,  at  this  time,  continuously  voltage  variable  components,  and 
thus  enjoying  a  continuously  variable  filter,  it  was  decided  to  vary  the 
components  in  discrete  steps  and  thus  obtain  a  discrete  rather  than  con- 
tinuous filter.  It  must  be  emphasized  that  the  restriction  to  discrete- 
ness will  be  removed  with  the  expected  advent  of  Vericaps  possessing  the 
proper  parameter  size. 

The  method  of  discretely  shifting  the  bandpass  is  to  change  the 
values  of  all  three  resistive  conponents  of  the  Twin  T  together  by  the  use 
of  relays.  The  control  voltage  or  shifting  the  passband  of  the  filter  is 
fed  to  a  transistor  relay  control  network.  As  the  control  voltage  rises, 
a  series  of  relays  are  closed.  Each  relay  closing  at  a  given  control 
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voltage  value  as  determined  by  the  relay  control  network.  As  each  relay 
closes,  the  three  resistances  of  the  IVrin  T  are  changed.  The  rejection 
frequency  of  the  TVdn  T  is  changed  and  thus  the  passband  of  the  filter  is 
shifted. 

Consider  first  the  relay  control  network  as  shown  in  Fig.  28.  The 
function  of  this  circuit  is  in  serial  fashion  to  cause  &  set  of  relays  to 
be  picked  up.  The  ccaitrol  voltage  varies  from  minus  20  volts  to  ground 
potential.  The  control  sequence  is  as  follows:  When  the  control  voltage 
la  at  minus  20  volts,  all  of  the  2N441  transistors  are  cut  off  causing  all 
of  the  relays  to  be  open.  When  the  control  voltage  rises  to  a  less  nega- 
tive potential  which  is  equal  to  the  negative  potential  of  the  emitter  of 
the  2N214  transistor  associated  with  number  one  relay,  the  2N21A  moves  from 
cutoff  to  an  operating  position.  This  action  causes  a  large  current  to  flow 
in  the  relay  pickup  coil,  due  to  the  current  and  power  gain  of  the  2N270 
and  2N441  circuitry.  Relay  one  is  thus  closed.  The  emitters  of  the  vari- 
ous 2N214  transistors  are  set  from  left  to  right  at  progressively  more 
positive  potentials,  the  individual  values  being  at  the  desired  control 
voltage  values  for  the  closing  of  the  relays.  Thus,  as  the  control  volt- 
age rises  from  minus  20  volts  to  ground,  the  relays  close  in  serial  fashion 
from  left  to  right  at  a  predetermined  control  voltage  value.  The  relays 
used  were  IW.   type  104753 •  These  relays  are  four-terminal  set  devices  en- 
abling the  relay  to  control  four  separate  circuits,  four  elements  independ- 
ently as  it  opens  and  closes.  Symbolically,  one  of  the  four  terminal  sets 
of  the  relay  is  shown  below. 


hi 


Normally    ^  F  /[         Normally 
Open  -* ^  ^ —      Closed 
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Figure  28.  Relay  Control  Network  Showing  First  Two  Stages 
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With  the  relay  non-energized,  the  control  arm  rests  in  the  noriaally  closed 
position.  When  energized,  the  control  arm  shifts  to  contact  the  normally 
closed  position. 

It  must  be  realized  that  the  pickup  time  of  the  relay  is  finity,  being 
of  the  order  of  approximately  3  milliseconds.  It  is  felt  that  this  pickup 
time  is  well  within  the  demands  of  the  system,  as  the  control  voltages  will 
vary  at  approximately  a  20  cps  rate. 

Variations  in  the  circuitry  of  the  relay  terminal  sets  allow  three 
different  methods  of  changing  the  resistive  components  of  the  Tvd.n  T,   The 
resistances  may  be  changed  by  adding  in  series  discrete  resistances,  by 
paralleling  resistances,  or  by  causing  the  relays  to  place  in  or  take  out 
individual  resistances.  Thie  parallel  method  is  shown  in  Fig.  29.  This 
method  is  not  recommended  as  the  resistance  values  associated  with  each 
step  tend  to  become  very  large  and  thus  have  a  higher  level  of  thermal 
noise. 

The  series  method  is  illustrated  in  Fig,  30  and  the  individual  method 
in  Fig.  31 •  Both  the  series  approach  and  the  individual  component  approach 
were  utilized  in  the  system  in  order  to  experimentally  determine  the  rela- 
tive merits  of  the  two.  Hie  series  approach  has  the  advantage  of  wiring 
simplicity  in  regards  to  connections  betv/een  the  resistors  of  the  matrix 
and  the  therminal  sets  of  the  relays.  Resistance  values  per  terminal  set 
are  naturally  lower  in  value.  The  individual  component  approach  was  found 
to  be  the  best  system.   In  the  individual  system,  each  position  of  the  band- 
pass may  be  set  up  and  tuned  without  regard  for  any  of  the  other  setups  for 
other  bandpass  positions.  In  the  series  approach,  if  for  a  given  control 
voltage  a  different  frequency  for  any  one  of  the  steps  is  desired,  the  en- 
tire resistance  matrix  associated  with  each  arm  of  the  matrix  must  be  changed. 
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Inasmuch  as  the  value  of  the  effective  real  stance  in  any  arm  for  a  given 
bandpass  position  is  extremely  critical,  any  variation  in  the  desired  cen- 
ter frequency  entails  an  inordinate  amount  of  labor. 

It  was  found  that  the  actual  design  and  constructim  of  the  Twin  T 
filter  tends,  as  various  references  in  the  literature  subtilely  imply,  to 
be  more  of  an  art  than  a  science.  On  this  basis,  the  inclusion  of  some  of 
the  empirical  procedures  determined  in  the  construction  of  the  filter  is 

deemed  to  be  warranted  in  this  paper. 

25 

The  basic  theory  of  the  Twin  T  will  first  be  investigated.    In 

generalized  form,  the  pjarameters  of  the  Twin  T,  as  shown  below,  must  con- 


form 


y.i\ 


%: 


1 


2C 


2. 


%, 


to  the  following  relationship  for  any  given  rejection  frequency, 
(1) 


Xo  /o   - 


The  rejection  frequency  is  then  given  by 

Various  degrees  of  sharpness  in  the  rejection  characteristic  at  any 
frequency  may  be  obtained  by  proper  manipulation  of  the  above  equations, 
A  measure  of  rejection  sharpness  is  given  by  equation  (3). 

Sharpness  of  rejection  is  indicated  by  lower  values  of  A,  For  a  symmetric 
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configuration,  Xi  -  X2,     Y^-  ^2t  ^he  analleat  A  obtainable  is  A  =  l  and 
occurs  when  X^-s  Xq~  !• 

By  going  to  an  unaymmetric  network,   smaller  of  A  may  be  obtained, 
A  convenient  design  for  the  unaymmetric   T  is 
Xi=  Yi-    1  Xo=    Yo=   2k 


1-^k 


X2  =  Y2  r   k 


Arl+jC 

2k 


The  network  that  is  most  usually  encountered  is  the  symmetric  network,  for 
which  Xi-  X2'  Xo=  Yi-  Y2  -  Yo=  1.  Tucker*^"  has  shown  that  when  a  Twin  T 
which  is  symmetric  with  A=l  is  included  in  the  negative  feedback  path  of 
an  amplifier  with  gain  equal  to  G,  as  shown  below. 
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7- 

the  Q  of  the  system  as  a  passband  filter  is  Q-^  • 

Scott   has  shown  that  the  input  impedance  of  the  Twin  T  is 
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Four  voltage  variable  bandpass  fi]ters  v;ere  constructed  during  the 
investigfttion.  Filter  j^l  covered  the  audio  spectrum  from  100  to  200  cpsj 
Filter  #2  from  300  to  1500  cps;  Filter  //3  from  1500  to  3000  cps;  and  Filter 
!^U   from  3000  to  6000  cps.  Table  1  shows  the  center  frequencies  and  band- 
pass characteristics  for  the  various  filters. 


Table  1 


Filter  Ifl 
Bandvo-dth  20  cps 


Filter  m 
Bandwidth  200  cps 


Center  Frea^,  cps 

Filter  #2 
Bandwidth  200  cps 

Center  Freq, 

100 

300 

110 

500 

125 

700 

U5 

900 

170 

1100 

200 

1300 
1500 

Center  Freq. 

Filter  ,f4 
Bandwidth  300  cps 

Center  Freq, 

1700 

3200 

1900 

3n00 

2100 

4000 

2300 

hhOQ 

2500 

4800 

2700 

5200 

2900 

5600 

For  discussion  purposes,  the  development  of  Filter  -^4  will  be  describ- 


ed. 


The  circuitry  for  Filter  ijk   is  shown  below  in  Fig.  32a.  The  construc- 
tion of  the  Twin  T  and  its  associated  relays  are  shown  in  Fig,  32b,  I^e 
circuit  is  seen  to  consist  of  a  cathode  folloiver,  the  Twin  T  matrix  with 
its  associated  relay  control  network,  and  a  stage  of  amplification.   If 
the  Twin  T  is  set  for  a  rejection  frequency  of  say  4000  cps,  then  there 
is  no  negative  feedback  to  the  grid  of  the  cathode  follower,  A  signal  of 
4000  cps  is  then  permitted  to  exist  at  the  cathode  of  the  cathode  follower 
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when  the  incoming  signal  is  4000  cps.   If  the  innut  frequency  is  changed, 
the  Twin  T  passes  this  frequency  and  there  is  a  negative  voltage  feedback 
to  the  grid  of  the  cathode  follower.  The  output  of  the  system  is  thus  re- 
duced. 

Originally,  the  position  of  the  cathode  follower  and  amplifier  were 
interchanged.   It  was  found  that  better  impedance  conditions  and  less  hwm 
were  encountered  in  the  configuration  shown.  The  Twin  T  requires  a  load 
impedance  at  least  three  times  as  great  as  the  sum  of  its  series  resis- 
tances. 

The  input  to  the  system  is  voice  characterized,  b^uid  limited  sound, 
band  limited  here  to  a  region  of  3000  to  6000  cps,  obtained  from  the  #4  - 
sound  generator.  The  action  of  the  filter  is  to  select  from  this  3000  to 
6000  cps  sub-band  a  smaller  band  300  cps  in  width,  the  particular  smaller 
band  chosen  being  determined  by  the  control  voltage.  There  are  seven 
small  bands  associated  with  this  filter.  The  center  frequencies  of  the 
bands  being  given  in  Table  1.  Wnen   all  the  relays  are  open,  the  small 
band  selected  is  the  lowest  frequency  band  in  the  sub-band.  When  the  DC 
level  of  the  control  signal  corresponds  to  a  frequency  of  3400  cps,  relay 
1  closes  and  the  smaller  band  selected  is  centered  at  3600  cps.  As  the 
DC  level  of  the  control  signal  varies  but  corresponds  to  any  frequency  from 
3400  to  3800  cps,  the  small  band  selected  remains  centered  at  3600  cps. 
When  the  control  signal  rises  to  a  value  corresponding  to  3800  cps  or 
better,  relay  #2  closes  and  the  small  band  centered  at  4000  cps  is  select- 
ed. When  any  given  relay  is  closed  all  lov/er  numbered  relays  remain  closed. 

Let  us  consider  the  design  of  the  filter.  First  assume  that  the  Twin 

T  is  completely  symmetric,  i.e.,  X^  X2  Xq  Y^  Y2  Yq   1.  I^e  gain  for 

the  amplifier  is  found  fi-oa  Q  22iS.  Here  there  appears  to  be  a  contradic- 

4 
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tion.  As  has  been  stated,  the  Q  of  this  circuit  must  vary  linearly 
with  frequency,  to  maintain  a  constant  bandpass.   If  the  Tvdji  T  is  com- 
pletely symmetric  at  all  center  frequencies,  then  the  Q  of  the  filter  is 
equal  to  ^^    and  thus  for  a  flat  aa^slif ier  the  Q  would  remain  constant 
over  the  passband.  It  will  be  shovm  that  by  modification  of  the  Twin  T, 
the  filter  will  have  a  Q  that  varies  linearly  with  frequency. 

Select  the  highest  Q  needed  in  the  sub-band.  For  the  #4  filter,  the 
maximum  Q=Jy  =  ^^06  cps  "  ^^•^7.  Tt\m   the  required  gain  of  the  ampli- 
fier is  G=4Q  3  74.68,  Ihe  required  gain  is  obtained  from  the  amplifier 
stage  with  a  15K  resistor  in  the  plate  circuit, 

Ihe  values  of  the  resistances  and  capacitors  in  the  Twin  T  must  now 
be  chosen.  Components  must  be  picked  subject  to  two  constraints.  The   re- 
sistances of  the  Twin  T  must  be  of  such  a  value  that  at  any  rejection  fre- 
quency, the  input  resistance  is  large  compared  to  the  cathode  follower 
resistance,  and  the  output  resistance  smaller  than  one-third  the  size  of 
the  input  impedance  to  the  amplifier  stage.  Ihe  capacitor  sizes  should  be 
large  enough  to  swamp  wiring  capacitance  and  amplifier  input  capacitance. 
It  must  be  stressed  that  the  Twin  T  is  very  finely  balanced  and  any  change 
in  the  effective  component  values  causes  wide  deviation  from  the  desired 
operation. 

The  design  equations  for  the  Twin  T  components  are: 

f. 


For 


■rejection  =  ^^  r^  ^ 
f  =  5600  cps  and  C^-SOOyy^/^- 
Rl=  56,9K   R3  =  28.45K 
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Experiraeaitally,    it  has  been  found  that  the  Q  of  the  filter  with  the  values 
ahovffi  above  being  utilized   is  lower  than  the  desired  design  value.      To 
obtain  the  desired  Q,  reduce  R-^  in  the  Twin  T  to  approximately  one-half  of 
its  design  value.      This  changes  the  rejection  frequency  by  a  proportion 
which  is   of  the  same  order  as  the  Q,      If  Rj_  and  R2  ^re  now  increased  slight- 
ly, the  rejection  frequency  returns  to  the  desired  value.     The  amplitude  of 
the  output  must  be  of  some  desired  level  and  minor  modifications  in  Ri, 
R2,  and  R^  will  allow  this  requirement  to  be  met.     For  comparison,  the  de- 
signed and  actual  circuit  values  are  shown  below t 

Design  Actual 

Itj_=  R2  r  56. 9K  62. OK 

R3  =  28.45K  16K 

03^=02=  500  AS  ^       500  //  ^ 

The  filter  is  then  tuned  for  the  next  rejection  frequency,  5200  cps, 
by  varying  the  components  of  the  Twin  T  to  obtain  the  desired  rejection 
frequency,  bandpass,   and  output  level. 

The  remaining  filters  are  of  the  same  design  with  minor  modifications, 
l^iodes  were  used  at  low  frequencies  as  the  input  capacitance  was  not  as 
large  a  problem  as  it  was  in  the  upper  sub-bands.     Various  short  cuts  were 
used  in  the  other  filters  to  ease  the  fine  tuning  requirements  on  the  Twin 
T.      The  amplitude  of  the  output  may  be  quickly  adjusted  by  varying  the  grid 
resistance  of  the  amplifier  stage.     Insertion  of  a  resistance  in  the  feed- 
back loop  to  the  grid  the  cathode  follower  varies  the  Q  of  the  system. 

Resistance  variation  in  the  Twin  T  was  chosen  instead  of  capacitive 

variation  for  practical  reasons.     This  resulted  in  more  extensive  tuning 

of  the  network  as  it  will  be  noted  from  equation  (4)  that  if  the  R»s  are 
varied  to  obtain  the  different  rejection  frequencies,  the  input  impedance 
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to  the  Twin  T  varies  with  rejection  frequency.  Whereas,  if  capacitors 
are  used  as  the  variable,  the  input  impedance  to  the  Twin  T  remains  con- 
stant • 

The  advent  of  Vericaps  of  the  required  8ize\  will  remove  the  need  for 
the  relays  and  transistor  relay  control  networks.  The  simplicity  and  per- 
formance of  the  resulting  circuit  will  be  excellent.  The  Twin  T  will 
contain  six  components  instead  of  an  entire  matrix  of  elements.  Capaci- 
tance variation  will  remove  the  necessity  of  juggling  the  system  to  ob- 
tain amplitude  output  equality  as  the  Twin  T  input  Impedance  will  be  in- 
variant with  rejection  frequency.  A  Q  linear  with  frequency  may  bo  ob- 
tained by  having  the  series  and  parallel  capacitors  vary  linearly  with 
control  voltage  but  at  slightly  different  slopes. 

In  passing,  an  important  comment  must  be  made.  In  operation  the 
action  of  the  frequency  information  control  signals  is  quantized.  That 
is,  the  control  signals  are  continuous  in  nature,  but  the  passbands  of 
the  various  filters  shift  in  discrete  steps  only.  The  extent  to  which 
the  intelligibility  of  the  speech  processing  scheme  is  effected  by  this 
quantization  must  be  determined  by  an  additional  investigation  in  which  a 
continuous  system,  using  voltage  variable  capacitors  of  the  proper  size, 
is  developed  and  used  for  comparison.   If  the  continuous  system  is  not 
markedly  more  efficient  than  the  quantized  system  now  being  investigated, 
then  further  bandwidth  compression  may  be  achieved  by  a  quantiaation  of 
the  control  signals. 

Figure  33  is  a  photograph  of  the  four  voltage  variable  filter  units 
and  the  modulator  unit.  Figure  34  is  a  photogranh  of  the  complete  labora- 
tory set-iQ)  for  the  speech  processing  system. 


90 


Figure  33.  Voltage  variable  filter  units  and  modulator  unit. 
Top  unit  is  modulator  unit.  Bottom  four  units  are 
the  four  voltage  variable  filters. 
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Figure  34.     Laboratory  set-up  for  speech  processing  system, 
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7.  Conclusions  and  Recommendations. 

In  retrospect  it  must  be  stated  that  the  investigation  presented  in 
this  paper  is  but  Phase  One  of  a  speech  processing  bandvd-dth  compression 
scheme  development  and  evaluation  effort.  Phase  One  consisted  of  the 
conceptual  evolution  of  the  system,  a  laboratory  implementation,  and  a 
successful  feasibility  demonstration.  Unfortunately,  time  limitations  on 
the  investigation  were  such  that  extensive  quantitative  results  were  not 
obtained.  Qualitative  results  and  the  performance  of  the  system  were  better 
than  anticipated  and  were  such  that  the  feasibility  of  successfully  exchang- 
ing voice  information  in  a  highly  compressed  b^dwidth  using  the  given  aye- 
tem  was  definitely  demonstrated. 

In  order  to  adequately  describe  the  qualitative  results  obtained  three 
things  must  first  be  discussed:  First,  the  state  of  the  system  during  the 
testing  period  J  Second,  the  environment  in  which  the  testing  was  done; 
and  Third,  the  development  of  a  Qualitative  intelligibility  scale  for  use 
in  adequately  discribing  the  results. 

Trouble  shooting  of  the  system  was  far  from  complete  when  the  system 
was  tested.  Severe  mismatches  between  elements  of  the  system  were  found 
to  exist.  Efforts  to  partially  eliminate  the  mismatches  resulted  in  vast 
improvements  in  the  intelligibility  of  the  system.  The  level  of  intelligi- 
bility achievable  in  a  matched  trouble-free  system  is  still  one  of  con- 
jecture. 

Testing  was  done  in  a  very  noisy  environment.  The  clicking  of  the  re- 
lays of  the  voltage  variable  filters  forced  conversationalists  to  raise 
their  voices  in  the  area  of  the  system  in  order  to  be  understood. 

In  order  to  most  clearly  describe  the  intelligibility  of  the  system 
the  following  scale  which  describes  given  intelligibility  levels  associat- 
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ed  vd.th  given  physical  environments, 

Intellij^ibility  Level  Physical  Situation 

A  Quiet  room  non-band limited  speech,   speaker 

recognition.  " 

B  Noise  room,  non-bandlimited  speech,   speaker 

recognition. 

C  Quiet  room,  bandlimited  speech,  speaker 

recognition,  i.e.  telephone  communication. 

D  Noisy  room,  bandlimited  speech,  speaker 

recognition. 

E  Slight  speech  distortion,  speaker  recogni- 

tion no  effort  to  recognize  words, 

F  Slight  distortion  vdth  noise,  speaker 

recognition,  only  very   slight  effort  to 
recognize  words, 

G  Distortion  and  noise  such  that  speaker  is 

not  recognizable,  very  mild  effort  to 
recognize  words, 

H  Medium  distortion  and  noise,  speaker  non- 

recognition,  slight  effort  to  recognize 
word  s . 

I  Distortion  and  noise  such  that  measurable 

effort  is  required  for  word  recognition, 

J  Distortion  and  noise  such  that  severe 

effort  is  required  for  word  recognition, 

K  Distortion  and  noise  such  that  many  words 

are  not  recognized  in  connected  text. 

L  Very  few  words  recognized  and  then  only  by 

extreme  effort, 

M  Total  non-recognition. 

Several  listeners  were  utilized  in  testing  the  Intelligibility  of  the 
system.  The  sound  inputs  to  the  system,  which  consisted  of  words,  vowels, 
and  other  sounds,  were  recorded  on  magnetic  tape  and  played  into  the  system 
80  that  the  listeners  could  only  hear  tho  output  of  the  system.  The  listen- 
ers were  given  no  clue  as  to  what  sounds  to  expect.  Words  in  cwitext  were 
not  used.  The   listeners  were  then  asked  to  identify  the  synthesized  soxinds 
coming  from  the  system.  The  evaluation  of  the  system  showed  that  for 
approximately  50^  of  the  test  words  the  intelligibility  level  corresponded 
to  level  "H"  above.  The  ranainder  of  the  test  words  had  a  level  of  "I". 

Certain  words  were  found  to  be  extremely  intelligible.  Some  of  these  were: 
six,  international,  avis,  nine,  and  corporation.  These  words  required  no 
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effort  for  recognition.   The  vowel  sounds  were  found  to  have  a  higher 
intelligible  level,  "G".  The  plosive  sounds  averaged  between  levels  "G" 
and  "H",  This  was  better  than  anticipated.   Inasmuch  as  the  plosives  have 
a  rapid  onset  time  it  was  thought  that  the  smoothing  action  of  the  integra- 
tors, filters,  etc.,  would  reduce' their  intelligibility.  The  fact  that 
they  were  better  than  anticipated  is  attributed  to  the  discrete  action  of 
the  voltage  variable  filter.   It  is  believed  that  the  transients  set-up 
when  entire  units  of  resistors  are  switched  in  and  out  of  the  filter  have 
onset  characteristics  similar  to  the  plosive  onset. 

TTie  RC  time  constants  of  the  system  were  such  that  each  of  the  seven 
control  signals  was  limited  to  a  maximum  variation  rate  of  20  cps.  The 
fastest  rise  time  for  the  control  signals  was  observed  to  be  20  milli- 
seconds which  corresponds  to  a  low  pass  filter  characteristic  with  a  cut- 
off of  17.5  cps.  For  seven  control  signals  each  with  a  20  cps  bandwidth 
the  total  bandwidth  for  the  system  is  I40  cps.  This  is  a  25 il  reduction 
over  the  3500  cps  voice  bandwidth  commonly  associated  with  SSB. 

The  goal  of  system  silence  between  words  was  achieved  and  no  speaker 
recognition  was  accomplished  by  the  test  listeners. 

Further  investigation  of  the  sound  generators  of  the  synthesizer  and 
the  pitch  synthesis  technique  is  recommended.   The  sound  generators  should 
be  a  homogenious  source  of  voice  characterized  bandlimited  sound.  Two 
techniques  were  utilized  to  develop  a  recorded  source  of  this  type  of  ex- 
citation.  The  first  technique  consisted  of  having  one  speaker  talk  through 
a  bandlimited  filter  onto  a  continuous  loop  of  magnetic  tape  while  the  loop 
cycles  past  the  vn-ite  head  many,  many  times.  It  was  found  that  the  linear 
addition  of  sound  on  the  tape  hoped  for  was  extremely  difficult  to  achieve. 
The  second  technique  consisted  of  having  several  speakers  talk  through  a 
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band limited  filter  simultaneously  and  recording  during  only  one  cycle  of 
the  tape.  This  system  was  found  to  be  far  superior  to  the  first  technique. 
But,  much  investigation  is  still  required  to  determine  the  optimum  means 
for  impleme  ting  the  sound  generator  concept. 

It  is  recommended  that  an  investigation  of  the  possibility  of  using  a 
pitch  oscillator  whose  frequency  is  controlled  by  the  Fitch  Control  Signal 
to  synthesize  the  pitch  frequency  be  conducted.  The  pitch  oscillator  would 
replace  the  100  to  200  cps  sound  generitor  and  the  voltage  variable  filter 
associated  vrlth   the  pitch  channel,  A  system  test  utilizing  the  pitch  oscill- 
ator will  determine  if  the  intelligibility  of  the  system  is  enhanced. 

System  recommendations,  aside  from  the  obvious  one  of  system  matching 
the  various  elements  of  the  system,  are  optimization  of: 

1.  Channel  frequency  limits  placement, 

2.  Bandwidth  of  voltage  variable  filters. 

3.  Relative  amplitude  levels  of  the  sound  generators. 
Investigation  is  still  required  to  determine  if  the  channels  selected 

by  the  analyzer  filter  bank  are  optimum  with  respect  to  frequency  limits 
and  bandwidth.  Perhaps  the  lowest  channel  should  not  be  from  300  to  1500 
cps  but  should  be  from  200  to  1000  cps.  The  proper  channel  width  and  fre- 
quency limits  can  only  be  optimized  by  further  intensive  investigation. 
Also  further  investigation  should  be  done  on  the  possibility  of  extracting 
amplitude  and  frequency  information  from  different  areas  in  the  frequency 
spectrum. 

Optimization  of  the  width  of  the  bandpass  of  the  voltage  variable 
filters  is  required.  Testing  of  the  system  should  be  done  using  different 
bandwidths  to  determine  the  best  bandwidth  to  use. 

During  the  testing  of  the  system  it  was  found  that  better  intelligi- 
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bility  resulted  if  the  amplitude  levels  of  the  sound  generators  were  not  the 
same.  Further  research  is  reouirod  to  determine  the  optimum  relative  an^Dli- 
tude  levels  of  the  sound  generators. 

The  speech  processing  system  provides  an  excellent  level  of  transmission 
security  in  itself.  An  enemy  cannot  reconstruct  speech  from  the  transmitted 
control  signals  unless  he  knows  the  exact  function  of  each  of  the  seven  con- 
trol signals  and  can  duplicate  the  system  synthesizer.  Further  security  can 
be  achieved  by  multiplexing  techniques  and  by  tine  and  frequency  scrambling 
of  the  control  signals. 
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